Remove Duplicate Rows using a CTE
June 18, 2012
Duplicate data has a way of sneaking into database tables, so it's nice to have a quick way to find and remove those duplicate rows. My favorite method is to use a Common Table Expression (CTE) that utilizes ROW\_NUMBER, OVER, and PARTITION BY to group the duplicates together. Then, you can simply DELETE WHERE RowNumber > 1 (assuming you want to keep the first row only).
In the sample below, we create a temp table, and intentionally allow some duplicate data to sneak in. If you run the sample as is, you will see the raw output of the CTE, with the duplicated rows containing multiple RowNumbers (our primary key in this case would be TimeZoneId and AdjustmentStart).
--Create a temp table to demonstrate deduping CREATE TABLE #TimeZoneOffsets (TimeZoneId varchar(75), AdjustmentStart datetime, AdjustmentEnd datetime, UTCOffset varchar(6)) ; INSERT INTO #TimeZoneOffsets SELECT 'Central Standard Time', '20120101 00:00', '20120311 01:59:59', '-06:00' UNION SELECT 'Central Standard Time', '20120311 02:00', '20121104 01:59:59', '-05:00' UNION SELECT 'Central Standard Time', '20121104 02:00', '20121231 23:59:59', '-06:00' UNION ALL --Allow duplicate SELECT 'Central Standard Time', '20121104 02:00', '20121231 23:59:59', '-06:00' ; WITH GroupedDupes AS (SELECT RowNumber = ROW_NUMBER() OVER (PARTITION BY TimeZoneId, AdjustmentStart ORDER BY TimeZoneId, AdjustmentStart) , TimeZoneId , AdjustmentStart FROM #TimeZoneOffsets) SELECT * FROM GroupedDupes ; -- Use to display duplicate rows --DELETE FROM GroupedDupes WHERE RowNumber > 1 ; -- Use to delete duplicate rows DROP TABLE #TimeZoneOffsets ;