How to merge duplicate records
Once you’ve identified the duplicate records and figured out which surviving records to keep, the last part of your deduplication logic is to merge the non-surviving records duplicate records into the surviving record. In some cases, you may want to simply discard or remove the non-surviving duplicate records. That simple scenario requires no further discussion.
First Establish a Default Logic, Then Exceptions
Chances are you have more than just a few data fields in the data you are looking to merge, perhaps even hundreds, and we have seen thousands. In order to scale, you should first establish a default merge logic that will be applied to all data fields. Once you have a default logic, then you can define exceptions for specific data fields. The most common default logis is “fill if empty”. We will discuss the various merge logics next.
Merge Logics
Fill If Empty
This is the most common merge logic, thus the most popular default merge logic. This logic says if any data field in the surviving record is empty, then attempt to fill it with a non-empty value from one of the non-surviving records. You also need to provide additional logic on what sequence to sort through the non-surviving records. Here is an example of 3 records in a duplicate set, with the non-surviving records sorted with more recently update record on top. The merge logic is fill if empty using latest modified record.
Surviving Original John Doe jdoe@acme.com
Non-surviving 1 J. Doe jdoe@acme.com VP Marketing
Non-surviving 2 John M. Doe jdoe@acme.com Acme Inc. CMO
——————————————————————————————————————————
Surviving Merged John Doe jdoe@acme.com Acme Inc. VP Marketing
Always Replace
This is exactly the same logic as the one above, except it doesn’t require the surviving record data field to be empty. It applies the merge logic to all the records in the duplicate group, including the surviving record, pick the value that meets the requirement, then replace the value in the surviving record, empty or not. Common examples include:
- Always take contact information from the last modified record
- Always take lead source from the earliest created record
Here is an example of 3 records in a duplicate set sorted by latest modified date on top. The merge logic for email is to use the latest modified date. The merge logic for lead source is to use the earliest modified date. The default merge logic is fill if empty.
Non-surviving 1 J. Doe jdoe@acme.com Acme Inc. Webinar
Surviving Original John Doe jdoe@looney.com Dreamforce 16
Non-surviving 2 John M. Doe jdoe@tunes.com Tunes Corp. Free Trial
——————————————————————————————————————————
Surviving Merged John Doe jdoe@acme.com Acme Inc. Free Trial
Append
With most merge logic you are throwing away some data you believe is not as good as the ones you are keeping. In some cases, you want to keep them all. This is common with unstructured data like notes or multi-value categories and segmentation data. For these data fields, use the append logic. Here is the same example above, but instead of keeping only the earliest modified lead source, we want to append lead source.
Non-surviving 1 J. Doe jdoe@acme.com Webinar
Surviving Original John Doe jdoe@looney.com Dreamforce 16
Non-surviving 2 John M. Doe jdoe@tunes.com Free Trial
——————————————————————————————————————————
Surviving Merged John Doe jdoe@acme.com Webinar, Dreamforce 16, Free Trial
Base on a Formula
For numerical or binary data fields, it often makes sense to apply a mathematical formula, such as:
- Pick the maximum or minimum value
- Calculate a sum or an average value
- True if only all records are true
Here is the same example above with 2 numerical fields: behavior score and demographic score. The merge logic is to pick the highest demographic score, but sum the behavior score.
Behavior Demographic
Non-surviving 1 J. Doe jdoe@acme.com 15 100
Surviving Original John Doe jdoe@looney.com 10 10
Non-surviving 2 John M. Doe jdoe@tunes.com 50 50
——————————————————————————————————————————
Surviving Merged John Doe jdoe@acme.com 75 100
Do Not Merge
This one is simple, for some data fields you just do not want to merge.
Leave a comment