Friday 25 February 2011

T-SQL: Joins Explained

I dare say that if you write T-SQL, you'll have had to write a JOIN in your time. Although much of this is quite basic, it doesn't hurt to be reminded and I hope that the examples are clear enough to illustrate the different types of JOINs.

USE tempdb
GO
-- setup tables
CREATE TABLE dbo.tblA (i1 INT, c1 CHAR(1), i2 INT)
GO
CREATE TABLE dbo.tblB (i1 INT, c1 CHAR(1), i2 INT)
GO

-- insert some dummy data
INSERT INTO dbo.tblA
SELECT 1, 'A', 6
UNION
SELECT
2, 'A', 5
UNION
SELECT
3, 'A', 4
UNION
SELECT
4, 'A', 3
UNION
SELECT
5, 'A', 2
UNION
SELECT
6, 'A', 1
GO

INSERT INTO dbo.tblB
SELECT 1, 'B', 6
UNION
SELECT
2, 'B', 3
UNION
SELECT
3, 'B', 4
UNION
SELECT
7, 'B', 1
GO

-- INNER JOIN (Just matches)
-- so total rows will be number of matched records
SELECT *
FROM dbo.tblA a
  
INNER JOIN dbo.tblB b
      
ON a.i1 = b.i1

-- LEFT JOIN (All records from left table, NULLs in the right table if no match)
-- so total rows will be number of records in the left table (tblA)
SELECT *
FROM dbo.tblA a
  
LEFT OUTER JOIN dbo.tblB b
      
ON a.i1 = b.i1

-- RIGHT JOIN (All records from the right table, NULLs in the left table if no match)
-- so total rows will be number of records in the right table (tblB)
SELECT *
FROM dbo.tblA a
  
RIGHT OUTER JOIN dbo.tblB b
      
ON a.i1 = b.i1

-- FULL JOIN (all records from both tables with matches or not)
-- so total rows will be number of rows in left table + number of rows in right table - number of exact matches)
SELECT *
FROM dbo.tblA a
  
FULL OUTER JOIN dbo.tblB b
      
ON a.i1 = b.i1

-- CROSS JOIN (cartesian product of both tables)
-- total rows of left table * right table
SELECT *
FROM dbo.tblA
  
CROSS JOIN dbo.tblB

-- "de Facto" INNER JOINs (using an OUTER join with a WHERE clause)
SELECT *
FROM dbo.tblA a
  
LEFT OUTER JOIN dbo.tblB b
      
ON a.i1 = b.i1
WHERE b.i1 = 1

-- "de Facto" INNER JOINs (using an OUTER join with a WHERE clause)
-- here the outer join is negated by the fact that you have included
-- filtering criteria which removes the NULL values
SELECT *
FROM dbo.tblA a
  
LEFT OUTER JOIN dbo.tblB b
      
ON a.i1 = b.i1
WHERE a.i2 = b.i2

-- its called a de factor INNER JOIN as its the same as writing:
SELECT *
FROM dbo.tblA a
  
INNER JOIN dbo.tblB b
      
ON a.i1 = b.i1
          
AND a.i2 = b.i2

-- the correct way or writing this query as a LEFT JOIN would be
SELECT *
FROM dbo.tblA a
  
LEFT OUTER JOIN dbo.tblB b
      
ON a.i1 = b.i1
          
AND a.i2 = b.i2

-- tidy up
DROP TABLE dbo.tblA
DROP TABLE dbo.tblB


Some resources:

INNER JOIN - http://msdn.microsoft.com/en-us/library/ms190014.aspx
OUTER JOIN - http://msdn.microsoft.com/en-us/library/ms187518.aspx
CROSS JOIN - http://msdn.microsoft.com/en-us/library/ms190690.aspx

Friday 18 February 2011

Admin: Data Compression Storage Savings

Data compression was introduced in SQL2008 (though sadly just in Enterprise/Developer Editions) and recently I was asked what sort of benefits you can get from it. So I thought i'd jot down a quick example to illustrate the reductions in space the different types of compression give you and also, an unexpected behaviour you get from using the REBUILD command.

Here i'm only speaking about storage savings. As with most things, there is a cloud to this silver lining although just how gloomy the cloud is will be the subject of a different post. Essentially though, compression gains disk space and IO read performance at the cost of CPU. Anyway, on to the example:

-- create the object
CREATE TABLE [dbo].[tblOutput](
  
[Output_ID] [int] NULL,
  
[T1] [int] NULL,
  
[T2] [int] NULL,
  
[Value] [varchar](255) NULL,
  
[Run_ID] [int] NULL,
  
[Date_Created] [datetime] NULL
)
ON [PRIMARY]

GO
CREATE NONCLUSTERED INDEX [IDX_tblOutput] ON [dbo].[tblOutput]
(
  
[Output_ID] ASC
) ON [PRIMARY]
GO

-- I populate this table with a 3rd party application which throws in plenty of rows and "random" values.

-- lets take a look at how much space this takes
EXEC sp_spaceused 'tblOutput'
GO

/*
name      rows        reserved    data        index_size  unused
--------- ----------- ----------  ----------  ----------  -------
tblOutput 19240000    1666896 KB  1025216 KB  641440 KB   240 KB
*/

-- and look into the physical stats DMV
SELECT page_count, record_count, compressed_page_count
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('tblOutput'), NULL, NULL, 'DETAILED')
WHERE index_id = 0

/*

NB: WOW CHECK OUT THAT FRAGMENTATION ON THE NONCLUSTERED INDEX!!

index_type_desc     avg_page_space_   avg_fragmentation                            compressed_
                    used_in_percent   _in_percent        page_count  record_count  page_count
------------------- ----------------- -----------------  ----------- ------------  ------------
HEAP                99.8007659995058  6.26560159760359   128152      19240000      0
NONCLUSTERED INDEX  53.5752656288609  99.4513134653689   79827       19240000      0
NONCLUSTERED INDEX  68.7832592043489  100                344         79827         0
NONCLUSTERED INDEX  50.9574993822585  100                2           344           0
NONCLUSTERED INDEX  0.531257721769212 0                  1           2             0

*/

-- first lets try ROW compression
ALTER TABLE tblOutput
REBUILD
  
WITH (DATA_COMPRESSION = ROW)
GO

/*
name      rows        reserved    data        index_size  unused
--------- ----------- ----------  ----------  ----------  -------
tblOutput 19240000    1153064 KB  808888 KB   343896 KB   280 KB
*/

/*

NB: THE NONCLUSTERED INDEX HAS BEEN DEFRAGMENTED!! WASN'T EXPECTING THAT!

index_type_desc     avg_page_space_   avg_fragmentation                           compressed_
                    used_in_percent   _in_percent        page_count record_count  page_count
------------------------------------  -----------------  ---------- ------------  -----------
HEAP                99.7074623177662  0.126572264852464  101112     19240000      0
NONCLUSTERED INDEX  99.8241907585866  0.01               42852      19240000      0
NONCLUSTERED INDEX  99.2432666172473  2.34375            128        42852         0
NONCLUSTERED INDEX  37.929330368174   0                  1          128           0
*/

-- now lets enable some PAGE compression (remember, this includes ROW compression)
ALTER TABLE tblOutput
REBUILD
  
WITH (DATA_COMPRESSION = PAGE)
GO

/*
name      rows        reserved    data        index_size  unused
--------- ----------- ----------  ----------  ----------  -------
tblOutput 19240000    904104 KB   559920 KB   343896 KB   288 KB
*/

/*
index_type_desc     avg_page_space_   avg_fragmentation                           compressed_
                    used_in_percent   _in_percent        page_count record_count  page_count
------------------  ----------------  -----------------  ---------- ------------  -----------
HEAP                99.8160612799605  0.136798905608755  70149      19240000      70145
NONCLUSTERED INDEX  99.8241907585866  0.01               42852      19240000      0
NONCLUSTERED INDEX  99.2432666172473  2.34375            128        42852         0
NONCLUSTERED INDEX  37.929330368174   0                  1          128           0
*/

-- and just revert the compression on the table
ALTER TABLE tblOutput
REBUILD
  
WITH (DATA_COMPRESSION = NONE)
GO

/*
name      rows        reserved    data        index_size  unused
--------- ------------ ----------  ----------  ----------  -------
tblOutput 19240000    1369384 KB  1025232 KB  343896 KB   256 KB
*/

/*
index_type_desc     avg_page_space_   avg_fragmentation                           compressed_
                    used_in_percent   _in_percent        page_count record_count  page_count
------------------  ----------------  -----------------  ---------- ------------  -----------
HEAP                99.7992092908327  0.118586942953439  128154     19240000      0
NONCLUSTERED INDEX  99.8241907585866  0.01               42852      19240000      0
NONCLUSTERED INDEX  99.2432666172473  3.125              128        42852         0
NONCLUSTERED INDEX  37.929330368174   0                  1          128           0
*/



You can view the documentation on the ALTER TABLE REBUILD command here but doesn't mention that nonclustered indexes will be rebuilt as a result of a REBUILD command.

Friday 11 February 2011

Admin: how much data do/will you have?

Its important to understand your storage requirements for your database. In many applications i've worked with, the bulk of data is stored in one table and its this table which is most sensitive to data load.

So with that in mind, you need to be able to predict just how large your table(s) is likely to get. If you've "inherited" a database, you can use existing data to make current predictions and harness some of the procedures and DMVs that SQL Server makes available. If you're in the situation of designing a database from scratch, you can make an informed guess by using the alrogithm posted on MSDN.

Existing Database

So how can you guage the size of a database and the objects within it? Lets look at the things at your disposal which give varying degress of accuracy:

1) The last backup size - this will be for the entire database and also include the data in your transaction log
2) The size of the mdf (+ndfs) - this won't tell you exactly how much data you have as the file may have been set to a certain size and not be "full" of data.
3) Right click the database in SSMS (or run DBCC SHOWFILESTATS) - this gives you the total extents used in the database from which you can get a ball park figure of how much space is required (An Extent is 8 x 64KB pages) but you'll need to rememeber that not all pages/extents will be "full" of data
4) EXEC sp_spaceused - this gives can be run for the entire database or just a single table
5) Query sys.partitions and sys.allocation_units to see how many pages an object is using

Empty Database

This article provides a method of estimating the space required in a table:
http://msdn.microsoft.com/en-us/library/ms175991.aspx
The articles give some caveats as to why this is only an estimate and in my analysis, I got a difference of approximately 10% more in actual size of data in a table compared to the estimate.

Example:

CREATE TABLE [dbo].[tblOutput](
  
[Output_ID] INT NULL,
  
[T1] INT NULL,
  
[T2] INT NULL,
  
[Value] VARCHAR(255) NULL,
  
[Run_ID] INT NULL,
  
[Date_Created] DATETIME NULL
)
ON [PRIMARY]
GO
-- 19240000 rows in table
-- ESTIMATED SIZE INFORMATION
-- 995820 kb simple estimate based upon #rows / rowsize in kb (not accurate, assumes pages are full)
-- 1012624 kb based upon MSDN algorithm to estimate data

-- ACTUAL SIZE INFORMATION
DBCC SHOWCONTIG('tblOutput')
-- 19678 Total Extents for the table DBCC SHOWCONTIG
-- 157392 Total Pages for the table DBCC SHOWCONTIG
EXEC sp_spaceused 'tblOutput'
-- 1905488 kb (including indexes)
-- 1259136 kb (just data)
DBCC SHOWFILESTATS
-- 1907648 kb based upon 29807 Extents (DBCC SHOWFILESTATS)

SELECT *
FROM sys.allocation_units u
  
INNER JOIN sys.partitions p
      
ON p.partition_id = u.container_id
WHERE OBJECT_NAME(p.OBJECT_ID) = 'tblOutput'
-- used 157393 pages ~ 1259144 kb

SELECT page_count, record_count
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('tblOutput'), NULL, NULL, 'DETAILED')
-- 157393 pages, 19240000 rows


You can see that although quite minor, depending where you look for your information you get differing results but for me, the differences are such that I wouldn't be too concerned on which method was used when it comes to predicting data growth. There are so many factors that can be introduced (notably indexes) that its questionable just how valuable these figures are in isolation.

As for which one to use, I suppose it makes sense to use sp_spaceused as this is likely to be maintained by MS as the truth on storage - or they will deprecate it and replace it.

Friday 4 February 2011

SSRS: where should I do my grouping?

I'm no SSRS expert and one of the things I often struggle with is where to group my data in a report. Essentially, you have the option of doing the aggregations within your T-SQL query or returning a standard dataset and then letting the Report Server engine manage the groupings.

Here are a few screenshots to illustrate the different approaches which both yield the same results.

Using T-SQL to Group

1) Write your grouping query in the dataset dialog box:



2) Attach your dataset to a table and drop in the columns - no additional changes are necessary in the reporting layer:



Using SSRS to Group

1) Write a simple query returning the dataset of records you want to display:



2) Create a table and attach your dataset but edit the Details group to add your grouping:



3) Drop your columns into the table but notice how there is a function around the aggregate value (by default SUM which I have changed to COUNT)



I am much more comfortable writing T-SQL statements and have less of an understanding of SSRS grouping behaviours so this tends to lead me down the T-SQL route. However, if you have are running reports on a high volume OLTP database then it may make sense to remove as much load as possible from the database and onto the report server. My example is also quite limited in that it only caters for a single grouping. As you add aggregations to your queries you may want to make use of SSRS rending groups to avoid repetition of data. In this instance, it probably makes sense to do all the grouping in one place rather than have a hybrid solution of some T-SQL grouping and some SSRS grouping.
/* add this crazy stuff in so i can use syntax highlighter