Tuesday 10 May 2011

T-SQL: Deduping with the CTE


NOTE: I'm re-publishing this blog post to join in T-SQL Tuesday #18 hosted by Bob Pusateri (blog | @SQLBob). This post had previously been published on 08/04/2011.


I blogged a while back on just how much I love CTEs due to their simplicity, ease of use, intuitiveness and flexibility. One of my most common uses of them is removing duplicates from a table. Its easy enough to write a query to find dupes, especially with the ranking functions but the great thing about using a CTE is that you can run your UPDATE/DELETE command on the resultant CTE and it effects the base table. This means you don't have to run a separate query to find the dupes and then join this back to your original table.

Here's an example:

-- set up our test dataset of all FIFA World Cup Winners
CREATE TABLE dbo.WorldCupWinners (Yr DATE, Winner NVARCHAR(255))
INSERT INTO dbo.WorldCupWinners VALUES ('19300101', 'Uruguay')
INSERT INTO dbo.WorldCupWinners VALUES ('19340101', 'Italy')
INSERT INTO dbo.WorldCupWinners VALUES ('19380101', 'Italy')
INSERT INTO dbo.WorldCupWinners VALUES ('19500101', 'Uruguay')
INSERT INTO dbo.WorldCupWinners VALUES ('19540101', 'Germany')
INSERT INTO dbo.WorldCupWinners VALUES ('19580101', 'Brazil')
INSERT INTO dbo.WorldCupWinners VALUES ('19620101', 'Brazil')
INSERT INTO dbo.WorldCupWinners VALUES ('19660101', 'England')
INSERT INTO dbo.WorldCupWinners VALUES ('19700101', 'Brazil')
INSERT INTO dbo.WorldCupWinners VALUES ('19740101', 'Germany')
INSERT INTO dbo.WorldCupWinners VALUES ('19780101', 'Argentina')
INSERT INTO dbo.WorldCupWinners VALUES ('19820101', 'Italy')
INSERT INTO dbo.WorldCupWinners VALUES ('19860101', 'Argentina')
INSERT INTO dbo.WorldCupWinners VALUES ('19900101', 'Germany')
INSERT INTO dbo.WorldCupWinners VALUES ('19940101', 'Brazil')
INSERT INTO dbo.WorldCupWinners VALUES ('19980101', 'France')
INSERT INTO dbo.WorldCupWinners VALUES ('20020101', 'Brazil')
INSERT INTO dbo.WorldCupWinners VALUES ('20060101', 'Italy')
INSERT INTO dbo.WorldCupWinners VALUES ('20100101', 'Spain');

-- we only want to get the distinct countries that have won
-- along with their first triumph which will be the FirstWin = 1
WITH wins
AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Winner ORDER BY Yr ASC) AS FirstWin
FROM dbo.WorldCupWinners
)
DELETE
FROM
wins -- note how we're deleting from the CTE and NOT the base table
WHERE FirstWin > 1
GO

-- And hey presto, we have our desired output
SELECT *
FROM dbo.WorldCupWinners

/*
Yr         Winner
1930-01-01 Uruguay
1934-01-01 Italy
1954-01-01 Germany
1958-01-01 Brazil
1966-01-01 England
1978-01-01 Argentina
1998-01-01 France
2010-01-01 Spain
*/

2 comments:

  1. Great example, and thanks so much for contributing to T-SQL Tuesday! I may have to adapt this for my own use....

    ReplyDelete
  2. Excellent example. I've done such deduping a few times in the past, but it was good to see it laid out so clearly in a good example.

    ReplyDelete

/* add this crazy stuff in so i can use syntax highlighter