Duplicate data has a way of sneaking into database tables, so it's nice to have a quick way to find and remove those duplicate rows...

Remove Duplicate Rows using a CTE

June 18, 2012

Duplicate data has a way of sneaking into database tables, so it's nice to have a quick way to find and remove those duplicate rows. My favorite method is to use a Common Table Expression (CTE) that utilizes ROW\_NUMBER, OVER, and PARTITION BY to group the duplicates together. Then, you can simply DELETE WHERE RowNumber > 1 (assuming you want to keep the first row only).

In the sample below, we create a temp table, and intentionally allow some duplicate data to sneak in. If you run the sample as is, you will see the raw output of the CTE, with the duplicated rows containing multiple RowNumbers (our primary key in this case would be TimeZoneId and AdjustmentStart).

--Create a temp table to demonstrate deduping
CREATE TABLE #TimeZoneOffsets (TimeZoneId varchar(75), AdjustmentStart datetime, 
                               AdjustmentEnd datetime, UTCOffset varchar(6)) ;
INSERT INTO #TimeZoneOffsets
SELECT 'Central Standard Time', '20120101 00:00', '20120311 01:59:59', '-06:00' UNION
SELECT 'Central Standard Time', '20120311 02:00', '20121104 01:59:59', '-05:00' UNION
SELECT 'Central Standard Time', '20121104 02:00', '20121231 23:59:59', '-06:00' UNION ALL --Allow duplicate
SELECT 'Central Standard Time', '20121104 02:00', '20121231 23:59:59', '-06:00' ;

WITH GroupedDupes AS
(SELECT RowNumber = ROW_NUMBER() 
    OVER (PARTITION BY TimeZoneId, AdjustmentStart ORDER BY TimeZoneId, AdjustmentStart) 
, TimeZoneId
, AdjustmentStart
FROM #TimeZoneOffsets)

SELECT * FROM GroupedDupes ; -- Use to display duplicate rows
--DELETE FROM GroupedDupes WHERE RowNumber > 1 ; -- Use to delete duplicate rows 
DROP TABLE #TimeZoneOffsets ;