Data deduplication
DM
Administrators and data managers can create data maintenanceUser directed automated jobs that improve data quality by targeting specific data quality issues like sub-object inactivation and data deduplication detection. subscriptions to identify duplicate HCP and HCO records in their Network instance. Duplicate records exist because of poor match rules, DCR processing errors, or because records are added without searching for existing records first. Use data deduplication jobs to compare specific records in your Network instance against all other records in your instance. Data deduplication subscriptions are not supported for custom objects.
This feature is enabled by default in your Network instance.
Finding duplicate records
Similar to source subscriptions, data deduplication subscriptions use the following tools:
- Data groups - to narrow the number of records being compared
- Match rules - to determine if records are the same or not
Match logs contain a summary of the records that were merged or suspect matched during the job. Comparisons are done at the entity level. Existing merge rules are used for sub-objects when records are merged.
All previous unmerges that have occurred in the instance are tracked and can be ignored from comparisons when you create a data deduplication job. However, rejected suspect match tasks and duplicates are tracked only from the feature implementation and onwards.
The jobs can be thoroughly tested before you commit to merging any records.
Considerations for data deduplication subscriptions
Before you create a subscription, review the following considerations:
- Match types - It is a best practice to begin using data deduplication with ASK
A single incoming record matches multiple incoming or Network records without one clear "best" match. Ambiguous matches typically require human review. matches only. After you have tested the job and are confident in your match rules, you can introduce ACT
A high confidence match between two records. ACT matches result in a merge without any human review. matches.
- Job size - Jobs can take some time to run, so prioritize your data and work with small, logical sets of data for each data deduplication job.
-
Applicable records - Only active and valid customer-owned HCP and HCO records from the specified Network Entity IDs (VIDs) are compared. These records are compared against all of the records in the instance, which includes Veeva OpenData records, third party mastered records and other customer-owned records.
Valid records for a country where match rules are not defined are skipped because the data deduplication process cannot match those records.
The following record types are excluded from the comparison: invalid, under review
A record that has not yet been validated by a data steward., externally mastered (Veeva OpenData or third party), and candidate records. Active and inactive sub-objects are considered in the match comparisons to help identify duplicate entities in your Network instance.
- Match against Veeva OpenData - Customer owned-records are compared against all records that exist in your Network instance. Records are not compared to records in the Veeva OpenData master instance.