Understanding the Duplicate Records Validation Rules

July 7, 2020

Our recent webinar  Confusing Validation Rules Explained  generated lots of follow-up questions from you. We are addressing those questions in a series of posts. In this edition, we will clarify the meaning and purpose of the duplicate records validation rules, then answer some frequently asked questions about duplicate records.

The Purpose of the Duplicate Records Validation Rules

There are 3 validation rules reporting duplicate records in tabulation data:

  • SD1117 for Findings domains
  • SD1201 for Events domains
  • SD1352 for Interventions domains

The purpose of the duplicate records validation rules is to identify multiple observations, events and interventions collected for the same subject at the exact same time. It is important to check for these situations because they can cause unanticipated results during analysis and can also confuse or complicate analysis by causing issues with FDA study data review tools.

“The Records Aren't Technically Duplicate Because the Results Are Different”

This is a common response to duplicate records findings. It is a misunderstanding to interpret the “Duplicate Records” issue message to mean that records are exact duplicates. That is not the intention of the validation rule. If your study performs multiple collections of the same test at the exact same time, you should explain why your study does this. The result variable is not used as a key variable for the validation rule because identifying these unexpected situations is important and explaining them in the Reviewers Guide is also critical.  

Key Variables Used by the Validation Rules

In general, the key variables used by the validation rules are a set of meaningful, common, industry-wide keys. The duplicate records rules have been revised several times to include additional key variables, such as new location-related or timing variables. It seems that now, the keys used are sufficient, and false-positive validation messages have been reduced. As new versions of the SDTMIG are released, the keys will be updated as needed for new variables.

Sponsor Keys vs Keys Used by the Rules

Many people ask why we don't use the sponsor's key variables listed in the define.xml. This is a good idea. In fact, we plan to introduce a new validation rule that checks sponsor's data versus the key variables listed for that domain in the define.xml. However, even when that new rule is in place, we will still need the duplicate records validation rules as they are now. As previously noted, the rules are not looking for exact duplicate records. These rules identify multiple observations, events, and interventions collected at the exact same time. Also, unfortunately, in many cases a list of specified key variables is still the most overlooked or intentionally invalid information in define.xml files. For example, some users incorrectly utilize --SEQ as a key variable.  

Sponsor-Defined Variables and Supplemental Qualifier Variables as Keys

The Reviewers Guide should include an explanation whenever sponsor-defined variables or supplemental qualifier variables are needed for uniqueness of the records. In the define.xml, for the domain, you should also list these variables as key variables. Yes, SUPPQUAL variables can be used as key variables in your define.xml. Pinnacle 21's Sergiy Sirichenko wrote a paper on the subject, titled How to use SUPPQUAL for specifying natural key variables in define.xml?… be sure to check it out.  

What If I Have Actual Duplicate Records?

Often, a sponsor-defined variable can differentiate the records, although the Duplicate Records issue still exists whenever you have multiple observations for a subject at the same time. For these records, if the issue cannot be fixed, you should provide a detailed explanation in the Reviewers Guide, including an explanation of why this situation exists, as well as if there are any variables that differentiate the records, and what those variables contain.

If you have actual duplicate records and there are no variables (--SEQ doesn't count!) to differentiate them, be sure to work with your data management group to identify why this situation exists. If the duplication cannot be corrected, include the explanation in the Reviewers Guide.  

Planned Improvements to the Rules

We do plan to improve the way the Duplicate Records rules work. One planned change is to support domain-specific key variables that the validation rules will use, which should resolve situations where key variables are specific to a single domain. Examples would include the ECG Test Results (EG) domain where according to SDTM IG having multiple findings for same test are allowed, and the Drug Accountability (DA) domain where the Reference ID (DAREFID) variable and the Sponsor-Defined Identifier (DASPID) variable are used for capturing label information. This will be an improvement introduced in a future release.