The Result Data Newsletter   
Volume 804 - April 2008   
© Copyright 2008 Result Data Consulting, Ltd.  614-505-0770  www.resultdata.com   

    Result Data Home Page  |  Newsletter Archive  |  Upcoming Events  |  Classes & Workshops  |  Request Information
Upcoming Events:  Mid Ohio BusinessObjects User Group: 5/20, Discover the Power of BusinessObjects XI 3.0:  5/20

Return to Newsletter Contents...

Finding and Removing Duplicate Records?

by: Charles Tournear, Sr. BI Consultant, CRCP, MCSE, MCT, MCDBA, MCSD

Suppose a report has been created on a SQL Server table and it is showing what appears to be duplicate data for several of the records.  This allows two possibilities.  Either the query for the report is not defined correctly or there are duplicate records in the table.

The first step then, is to determine whether duplicate data exists.  This can be done with a simple query using a group by clause.  We need to first determine which column or columns indicate that duplicate data exists.  For example if the Social Security Number field is not supposed to be duplicated in a table of personal, then this may be the only field we need to check for duplicates.  For other tables we may need to use all the columns in the table to see true duplicates.

The easiest way to find duplicates is to use the count(*) aggregate function in association with the group by clause.  For the SSN scenario above the code would look like:

SELECT ssn, count(*)

FROM personal

GROUP BY ssn

HAVING count(*) > 1

For the scenario using all fields:

SELECT first_name, middle_init, last_name,

     address1, address2, city, state, zip, phone, count(*)

FROM personal

GROUP BY first_name, middle_init, last_name,

     address1, address2, city, state, zip, phone

HAVING count(*)> 1

Note the Having clause in each query.  Those records that have duplicates in all fields listed on the group by line will have a count(*) greater than 1.  This code then lets us know whether or not duplicates do indeed exist or not.  If no rows are returned then there are no duplicates based on the columns that we have chosen to group by.

If duplicates are found then the next step is to somehow select all but one of each record.  The choice for how to do this will depend on whether unique identity columns exist within the table or not.

Let’s first look at the example where the table has a unique identity column. In this case it’s a simple id field of integer type that has an identity (1,1) starting with 1 and incrementing by 1.  This guarantees that even with duplicate records there is one field that uniquely defines a specific row in the table.  We need to first be able to uniquely identify each row within a group of duplicate records.  In T-SQL we can use the RANK function to give a number for each record in the group of duplicates.

SELECT  ssn, RANK() OVER(PARTITION BY ssn ORDER BY id) AS rank

FROM personal

This provides a list of rows ranking each group of duplicates.  The RANK function automatically numbers rows 1,2,3,4, etc. with a group.  The OVER clause can be used to determine the sort order that the RANK function will use to number the records.  It can also be used to determine the groupings through the PARTITION BY clause inside the OVER.  The fields after PARTITION BY need to be all the fields that were used to determine the duplicates.  In other words all the fields that were used in the GROUP BY clause to find the duplicates in the first place.  The ORDER BY determines the sort order of the records within the group. 

So why didn’t I sort or ORDER BY the ssn field?  The RANK function will supply the same rank value for rows that have the same value in the ORDER BY clause.

For example if I had ordered by ssn field the output would look like.

111-11-1111                                    1

111-11-1111                  1

222-22-2222                  1

222-22-2222                  1

By using the unique ID field the output becomes:

111-11-1111                  1

111-11-1111                  2

222-22-2222                  1

222-22-2222                  2

We could also have used other fields as long as the field uniquely defines the difference between each record within the group.  For example if we had an date_updated field in the table and the date_updated was different for each duplicate record with in the group then we may have wanted to sort by the date_updated field in descending order:

SELECT ssn, RANK() OVER(PARTITION BY ssn

`     ORDER BY date_updated DESC) AS rank

FROM personal

Now that we have uniquely defined each duplicate by a combination of a unique identifier field and a ranking with a group we can select and delete all but the first ranking record with the grouping of duplicates.  We use the above query as a sub-query to define a new table of data and select a list of id’s that do not include the first ranking value within each group:

Example 1:

DELETE FROM personal

WHERE id IN

(SELECT

     t.id

 FROM

     (SELECT

          id

          ,RANK() OVER(PARTITION BY ssn

               ORDER BY date_updated DESC) AS rank

      FROM personal) AS t

 WHERE rank <> 1)

Example 2:

DELETE FROM personal

WHERE id IN

(SELECT id

 FROM

     (SELECT

          id

          ,RANK() OVER (PARTITION BY first_name, middle_init,

               last_name, address1, address2, city, state,

                    zip, phone ORDER BY id) AS rank

          FROM personal) AS t

 WHERE rank <> 1)

Now let’s look at the scenario where we don’t have a unique identifier for each row in the table.  In this case we can’t easily determine which records to keep and which records to delete.  Our only choice is to delete all of the duplicated records and then re-insert a single copy of each.

So our first step this time is to uniquely identify each record so that we can store a single copy of each row in each group of duplicates.  For this we can use the ROW_NUMBER function.  The ROW_NUMBER function creates an id number for each record.

SELECT *, ROW_NUMBER() OVER(ORDER BY ssn) AS row_id FROM personal

Now that we have a unique row number we can rank the duplicates the same as before.  Keep in mind that the new row number field isn’t real, so we can’t use it to delete specific rows.

SELECT

     t.*

     ,RANK() OVER(PARTITION BY t.ssn ORDER BY t.row_id) AS rank

FROM

     (SELECT *, ROW_NUMBER() OVER(ORDER BY ssn) AS row_id

     FROM personal) AS t

We can then insert the first ranking values of the duplicates into a temporary table, delete all of the duplicate records and then reinsert the single records back in.

Example 1:

SELECT

      ssn, field2, field3, field4, field5

INTO #temptable

FROM

     (SELECT

          t1.*

          ,RANK() OVER(PARTITION BY t1.ssn ORDER BY t1.row_id)

               AS rank

     FROM

          (SELECT *, ROW_NUMBER() OVER(ORDER BY ssn) AS row_id

          FROM personal) AS t1

     ) AS t2

WHERE t2.rank = 1

 

DELETE FROM personal

WHERE ssn IN

     (SELECT ssn

     FROM personal

     GROUP BY ssn

HAVING count(*) > 1)

INSERT INTO personal

SELECT * FROM #temptable

Example 2:

SELECT

     t2.first_name, t2.middle_init, t2.last_name, t2.address1

     ,t2.address2, t2.city, t2.state, t2.zip, t2.phone

INTO #temptable

FROM

     (SELECT

          t1.first_name, t1.middle_init, t1.last_name

          ,t1.address1, t1.address2, t1.city, t1.state, t1.zip

          ,t1.phone, RANK() OVER(PARTITION BY t1.ssn

               ORDER BY t1.row_id) AS rank

     FROM

          (SELECT *, ROW_NUMBER()

               OVER(ORDER BY last_name) AS row_id

          FROM personal) AS t1

      ) AS t2

WHERE t2.rank = 1

DELETE FROM personal

FROM personal

GROUP BY first_name, middle_init, last_name,

address1, address2, city, state, zip, phone

HAVING count(*)> 1

INSERT INTO personal

SELECT * FROM #temptable

As you can see the process is much easier if there is already a unique identity field within the table.   This is one reason why it is recommended that all tables have a new integer id field added to them that is an identity field.  Query on and avoid duplicity.

Go to Top  |  Return to Newsletter Contents

The Result Data Newsletter is published approximately once a month to share the latest information on business intelligence, data management and CRM. There should be a link below to allow you to change or remove yourself from our list. We take your requests very seriously. If you have any difficulty please contact us at 614-505-0770 and we will make sure that your request is handled properly. This is not intended to be an unsolicited message and you can reach us in person if needed.

© Copyright 2007 Result Data Consulting, Ltd. - All Rights Reserved
All trademarks and copyrights are the property of their respective owners. This information is provided without warranty.
Announcements
Save Big $$ On Gas This Spring
Schedule and attend any one of our qualifying public training classes from April through June 30th to receive up to $385 in gas cards OR  receive 10% off the normal class price.  Please contact our office for further details on promotion amounts and qualifying classes.   Restrictions apply and you must mention the promotional code Gas08 at the time of registration to receive the promotion.  Call 614-505-0770 for further details and restrictions.
Columbus Roadshow - Discover the power of BusinessObjects XI 3.0
See how you could put complete, trustworthy business intelligence into the hands of all of your people wherever their location.  This event will be held on May 20, 2008 at the Fawcett Center.  Call 614-505-0770 or click here for more information and to reserve your seat.
Mid Ohio BusinessObjects User Group
Join us for the Spring MOBOUG User Group meeting on May 20th. Call 614-505-0770 or click here for more information and to reserve your seat.
Looking for a Few Good Men and Women
Join our award winning team of Business Intelligence consultants and .Net Software developers.
Send your resume and salary requirements to:
jobs@resultdata.com