|
Return to Newsletter Contents...
Finding and Removing Duplicate Records?
by: Charles Tournear, Sr. BI Consultant, CRCP, MCSE, MCT, MCDBA, MCSD
Suppose a report has been created on a SQL Server table and it is showing what
appears to be duplicate data for several of the records.
This allows two possibilities.
Either the query for the report is not defined correctly or there are
duplicate records in the table.
The first step then, is to determine whether duplicate data exists.
This can be done with a simple query using a group by clause.
We need to first determine which column or columns indicate that
duplicate data exists. For example
if the Social Security Number field is not supposed to be duplicated in a table
of personal, then this may be the only field we need to check for duplicates.
For other tables we may need to use all the columns in the table to see
true duplicates.
The easiest way to find duplicates is to use the count(*) aggregate function in
association with the group by clause.
For the SSN scenario above the code would look like:
SELECT
ssn, count(*)
FROM
personal
GROUP BY
ssn
HAVING
count(*) >
1
For the scenario using all fields:
SELECT
first_name, middle_init, last_name,
address1,
address2, city, state, zip, phone, count(*)
FROM
personal
GROUP BY
first_name, middle_init, last_name,
address1,
address2, city, state, zip, phone
HAVING
count(*)>
1
Note the Having clause in each query.
Those records that have duplicates in all fields listed on the group by
line will have a count(*) greater than 1.
This code then lets us know whether or not duplicates do indeed exist or
not. If no rows are returned then
there are no duplicates based on the columns that we have chosen to group by.
If duplicates are found then the next step is to somehow select all but one of
each record. The choice for how to
do this will depend on whether unique identity columns exist within the table or
not.
Let’s first look at the example where the table has a unique identity column. In
this case it’s a simple id field of integer type that has an identity (1,1)
starting with 1 and incrementing by 1.
This guarantees that even with duplicate records there is one field that
uniquely defines a specific row in the table.
We need to first be able to uniquely identify each row within a group of
duplicate records. In T-SQL we can
use the RANK function to give a number for each record in the group of
duplicates.
SELECT
ssn, RANK() OVER(PARTITION BY ssn
ORDER BY id) AS
rank
FROM
personal
This provides a list of rows ranking each group of duplicates.
The RANK function automatically numbers rows 1,2,3,4, etc. with a group.
The OVER clause can be used to determine the sort order that the RANK
function will use to number the records.
It can also be used to determine the groupings through the PARTITION BY
clause inside the OVER. The fields
after PARTITION BY need to be all the fields that were used to determine the
duplicates. In other words all the
fields that were used in the GROUP BY clause to find the duplicates in the first
place. The ORDER BY determines the sort
order of the records within the group.
So why didn’t I sort or ORDER BY the ssn field?
The RANK function will supply the same rank value for rows that have the
same value in the ORDER BY clause.
For example if I had ordered by ssn field the output would look like.
111-11-1111
1
111-11-1111
1
222-22-2222
1
222-22-2222
1
By using the unique ID field the output becomes:
111-11-1111
1
111-11-1111
2
222-22-2222
1
222-22-2222
2
We could also have used other fields as long as the field uniquely defines the
difference between each record within the group.
For example if we had an date_updated field in the table and the
date_updated was different for each duplicate record with in the group then we
may have wanted to sort by the date_updated field in descending order:
SELECT
ssn, RANK() OVER(PARTITION BY ssn
`
ORDER BY date_updated
DESC) AS rank
FROM
personal
Now that we have uniquely defined each duplicate by a combination of a unique
identifier field and a ranking with a group we can select and delete all but the
first ranking record with the grouping of duplicates.
We use the above query as a sub-query to define a new table of data and
select a list of id’s that do not include the first ranking value within each
group:
Example 1:
DELETE FROM
personal
WHERE
id IN
(SELECT
t.id
FROM
(SELECT
id
,RANK() OVER(PARTITION BY ssn
ORDER BY
date_updated DESC) AS
rank
FROM
personal) AS t
WHERE
rank <> 1)
Example 2:
DELETE FROM
personal
WHERE
id
IN
(SELECT
id
FROM
(SELECT
id
,RANK() OVER (PARTITION BY first_name,
middle_init,
last_name, address1, address2, city, state,
zip, phone ORDER BY id) AS rank
FROM
personal) AS t
WHERE
rank <> 1)
Now let’s look at the scenario where we don’t have a unique identifier for each
row in the table. In this case we
can’t easily determine which records to keep and which records to delete.
Our only choice is to delete all of the duplicated records and then
re-insert a single copy of each.
So our first step this time is to uniquely identify each record so that we can
store a single copy of each row in each group of duplicates.
For this we can use the ROW_NUMBER function.
The ROW_NUMBER function creates an id number for each record.
SELECT
*, ROW_NUMBER() OVER(ORDER BY ssn)
AS row_id FROM
personal
Now that we have a unique row number we can rank the duplicates the same as
before. Keep in mind that the new
row number field isn’t real, so we can’t use it to delete specific rows.
SELECT
t.*
,RANK()
OVER(PARTITION BY t.ssn
ORDER BY t.row_id) AS rank
FROM
(SELECT
*, ROW_NUMBER() OVER(ORDER BY ssn)
AS row_id
FROM personal)
AS t
We can then insert the first ranking values of the duplicates into a temporary
table, delete all of the duplicate records and then reinsert the single records
back in.
Example 1:
SELECT
ssn, field2,
field3, field4, field5
INTO
#temptable
FROM
(SELECT
t1.*
,RANK() OVER(PARTITION BY t1.ssn
ORDER BY t1.row_id)
AS
rank
FROM
(SELECT *, ROW_NUMBER()
OVER(ORDER BY ssn) AS row_id
FROM personal) AS
t1
)
AS t2
WHERE
t2.rank = 1
DELETE FROM
personal
WHERE
ssn IN
(SELECT ssn
FROM personal
GROUP
BY ssn
HAVING count(*)
> 1)
INSERT INTO
personal
SELECT
* FROM #temptable
Example 2:
SELECT
t2.first_name, t2.middle_init, t2.last_name, t2.address1
,t2.address2,
t2.city, t2.state, t2.zip, t2.phone
INTO
#temptable
FROM
(SELECT
t1.first_name, t1.middle_init, t1.last_name
,t1.address1, t1.address2, t1.city, t1.state, t1.zip
,t1.phone, RANK() OVER(PARTITION BY t1.ssn
ORDER BY
t1.row_id) AS rank
FROM
(SELECT *, ROW_NUMBER()
OVER(ORDER BY
last_name) AS row_id
FROM personal) AS
t1
)
AS t2
WHERE
t2.rank = 1
DELETE FROM
personal
FROM
personal
GROUP BY
first_name, middle_init, last_name,
address1, address2, city, state, zip, phone
HAVING
count(*)>
1
INSERT INTO
personal
SELECT
* FROM #temptable
As you can see the process is much easier if there is already a unique identity
field within the table. This
is one reason why it is recommended that all tables have a new integer id field
added to them that is an identity field.
Query on and avoid duplicity.
Go to Top |
Return to Newsletter Contents
|