PROC COMPARE Tips and Techniques in SDTM Programming: A Comprehensive Guide with Examples
PROC COMPARE
is a powerful and versatile procedure in SAS that is extensively used in SDTM (Study Data Tabulation Model) programming for validating and verifying datasets. It allows you to compare two datasets to identify differences, ensuring data consistency, integrity, and accuracy. This expanded report provides detailed tips, techniques, and advanced strategies for effectively using PROC COMPARE
in SDTM programming, along with practical examples.
1. Basic Usage of PROC COMPARE
At its core, PROC COMPARE
compares two datasets—referred to as the "base" and "compare" datasets—to highlight differences. This is particularly useful in SDTM programming when verifying that derived datasets match the original data or when comparing outputs from independent programming.
/* Basic example of PROC COMPARE */
proc compare base=sdtm.dm compare=qc.dm;
run;
In this example, the SDTM domain dataset dm
is compared with a QC (Quality Control) dataset to identify any discrepancies. This straightforward use case highlights any differences in data content between the two datasets.
2. Key Options for PROC COMPARE
PROC COMPARE
offers various options to customize the comparison process, making it more suitable for specific tasks in SDTM programming. Below are some of the most useful options with detailed explanations and examples.
2.1. ID Statement
The ID
statement is crucial in SDTM datasets, where each record typically represents a specific subject or event. This statement specifies key variables that uniquely identify observations in the datasets being compared.
/* Using ID statement */
proc compare base=sdtm.dm compare=qc.dm;
id usubjid;
run;
In this example, the unique subject identifier usubjid
is specified as the key variable, ensuring that records are compared correctly between the base and compare datasets. This helps in cases where the datasets may not be sorted identically but need to be compared observation by observation based on the unique identifier.
2.2. VAR Statement
The VAR
statement allows you to specify which variables to compare. This is particularly useful when you only need to validate specific variables within a dataset, reducing unnecessary output and focusing the comparison on critical variables.
/* Using VAR statement */
proc compare base=sdtm.lb compare=qc.lb;
id usubjid visitnum lbtestcd;
var lbtest lbcat lbreslb;
run;
In this example, only the variables lbtest
, lbcat
, and lbreslb
are compared between the two datasets. This targeted approach helps ensure that only the variables of interest are scrutinized, which is particularly useful in large datasets where comparing every variable might be unnecessary or overwhelming.
2.3. CRITERION Option
The CRITERION
option specifies the tolerance level for numerical comparisons. This is particularly useful when dealing with floating-point numbers, where minor differences might occur due to rounding. Setting an appropriate criterion ensures that small, insignificant differences do not trigger false discrepancies.
/* Using CRITERION option for tolerance */
proc compare base=sdtm.vs compare=qc.vs criterion=0.01;
id usubjid visitnum vstestcd;
run;
Here, a criterion of 0.01 is set, meaning that differences smaller than 0.01 between the base and compare datasets are ignored. This is particularly important in clinical data where small numerical differences may arise from different levels of precision or rounding methods in different software or programming environments.
2.4. LISTALL Option
The LISTALL
option generates a detailed report of all differences, including variable names, values, and observation numbers. This comprehensive output is useful for a thorough review of discrepancies, especially when you need to document or understand all the differences between datasets.
/* Using LISTALL option */
proc compare base=sdtm.ae compare=qc.ae listall;
id usubjid aeseq;
run;
This example generates a detailed list of all differences found in the ae
dataset, helping you pinpoint specific issues. The detailed nature of this output is particularly useful in audits or when you need to present a comprehensive comparison report to stakeholders.
2.5. OUT= and OUTBASE/OUTCOMP/OUTDIFF Options
These options allow you to output the results of the comparison to a dataset. The OUT=
option specifies the output dataset, while OUTBASE
, OUTCOMP
, and OUTDIFF
control which observations are included in the output dataset.
/* Outputting comparison results to a dataset */
proc compare base=sdtm.ex compare=qc.ex out=comp_results outbase outcomp outdiff noprint;
id usubjid exseq;
run;
This example outputs the comparison results to a dataset named comp_results
, including all base and compare observations, as well as those that differ. The noprint
option suppresses the printed output in the log, focusing the results on the output dataset, which can then be reviewed or used in further processing.
2.6. ALLOBS Option
The ALLOBS
option forces PROC COMPARE
to compare all observations, even if the datasets are of different lengths. This can be useful when you want to ensure that every record in one dataset is accounted for in the other.
/* Using ALLOBS option */
proc compare base=sdtm.vs compare=qc.vs allobs;
id usubjid visitnum vstestcd;
run;
This option is particularly useful when you suspect that there might be missing or extra records in one of the datasets, ensuring that the comparison is exhaustive.
2.7. NOVALUES Option
If you’re only interested in whether differences exist without needing to see the actual differing values, the NOVALUES
option can be used to suppress the detailed listing of differences.
/* Using NOVALUES option */
proc compare base=sdtm.dm compare=qc.dm novalues;
id usubjid;
run;
This example will report if differences are found but will not list the specific values that differ. This is useful for a quick check when detailed information about differences is not necessary.
3. Handling Common Issues in PROC COMPARE
While PROC COMPARE
is powerful, certain issues can arise during comparisons. Below are some common problems and strategies to address them effectively.
3.1. Dealing with Missing Values
Missing values can complicate comparisons, as PROC COMPARE
may treat them differently depending on the context. The COMPARE
and WITH
statements can help manage these situations by allowing you to explicitly compare specific variables, even when some may be missing.
/* Handling missing values */
proc compare base=sdtm.cm compare=qc.cm;
id usubjid cmseq;
with cmdose cmtrt;
run;
This example compares the variables cmdose
and cmtrt
between the base and compare datasets, even if some of these variables might have missing values. This targeted approach ensures that missing data does not result in an incomplete comparison.
3.2. Comparing Datasets with Different Structures
If the datasets have different structures (e.g., different variable names or lengths), use the NOMISSING
or NOSOURCE
options to focus the comparison on the variables that are common to both datasets. This ensures that the comparison is relevant and does not get bogged down by structural differences that are not meaningful to the comparison task.
/* Comparing datasets with different structures */
proc compare base=sdtm.suppae compare=qc.suppae nosource;
id usubjid qnam;
run;
The NOSOURCE
option excludes variables that are only present in one of the datasets, allowing for a focused comparison. This is particularly useful when datasets have evolved over time or when you’re comparing datasets from different sources or stages of data processing.
3.3. Comparing Character Variables with Different Lengths
When comparing character variables, differences in variable lengths can cause issues. Use the MAXPRINT=
option to limit the number of differing values printed in the output, especially when dealing with large text fields.
/* Comparing character variables with different lengths */
proc compare base=sdtm.cm compare=qc.cm maxprint=50;
id usubjid cmseq;
var cmtrt;
run;
This example limits the output to the first 50 differences found, which can be helpful when comparing datasets with many observations and potential differences in character variables.
4. Advanced Techniques for PROC COMPARE
For more complex comparisons or when working with large datasets, advanced techniques can enhance the functionality of PROC COMPARE
. Below are several strategies to take your comparisons to the next level.
4.1. Using Formats for Customized Comparisons
Applying formats to variables before comparison can help standardize the data and make the comparison more meaningful, especially for categorical variables. This approach is particularly useful in SDTM programming, where data may come from different sources or be transformed in different ways.
/* Using formats for comparison */
proc format;
value sexfmt
1 = 'Male'
2 = 'Female';
run;
proc compare base=sdtm.dm compare=qc.dm;
id usubjid;
var sex;
format sex sexfmt.;
run;
This example applies a custom format to the sex
variable, ensuring that the comparison is done on standardized values. This technique is useful when datasets may use different coding schemes or when you want to compare datasets in a more human-readable format.
4.2. Limiting the Comparison to Key Observations
If you're only interested in comparing key observations (e.g., critical time points or specific subjects), you can subset the data before running PROC COMPARE
. This focused approach ensures that the comparison is relevant and not cluttered by extraneous data.
/* Subsetting data before comparison */
data sdtm_sub;
set sdtm.vs;
where visitnum in (1, 4, 8); /* Only key visits */
run;
proc compare base=sdtm_sub compare=qc.vs;
id usubjid visitnum vstestcd;
run;
This example subsets the vs
dataset to only include key visits before performing the comparison, focusing on the most important data points. This technique is particularly useful when you are only interested in certain aspects of the data, such as baseline or endpoint visits.
4.3. Automating Comparisons with Macros
You can automate the comparison process across multiple datasets using SAS macros, making it easier to perform routine checks across an entire SDTM package. This approach is especially useful in large clinical trials where many datasets need to be compared on a regular basis.
%macro compare_datasets(base=, compare=, idvars=);
proc compare base=&base compare=&compare;
id &idvars;
run;
%mend compare_datasets;
%compare_datasets(base=sdtm.dm, compare=qc.dm, idvars=usubjid);
%compare_datasets(base=sdtm.lb, compare=qc.lb, idvars=usubjid lbtestcd visitnum);
%compare_datasets(base=sdtm.ae, compare=qc.ae, idvars=usubjid aeseq);
This macro simplifies the process of comparing multiple SDTM datasets, ensuring consistency across your validation efforts. By automating these comparisons, you can save time and reduce the risk of human error in routine QC processes.
4.4. Using BY-Group Comparisons
BY-group processing in PROC COMPARE
allows you to compare subsets of your data independently. This is particularly useful in SDTM datasets where comparisons need to be made within specific groups, such as treatment arms or study phases.
/* Using BY-group comparisons */
proc compare base=sdtm.lb compare=qc.lb;
by usubjid visitnum;
id lbtestcd;
run;
In this example, the comparison is done within each combination of usubjid
and visitnum
, allowing for a more granular comparison that considers the structure of the data. This is especially useful when the same variables may have different expected values in different groups, such as treatment vs. control groups.
4.5. Comparing Large Datasets Efficiently
When working with large datasets, PROC COMPARE
can become resource-intensive. Using the MAXPRINT=
option to limit output, and the OUTNOEQUAL
option to exclude matching records from the output, can help manage performance.
/* Efficiently comparing large datasets */
proc compare base=sdtm.lb compare=qc.lb maxprint=(50,10) outnoequal;
id usubjid visitnum lbtestcd;
run;
In this example, the output is limited to the first 50 differences and 10 variable differences per observation. The OUTNOEQUAL
option ensures that only records with differences are included in the output, reducing the amount of data that needs to be processed and reviewed.
5. Interpreting PROC COMPARE Output
Understanding the output from PROC COMPARE
is crucial for identifying and addressing discrepancies. The key sections of the output include:
- Summary Section: Provides a quick overview of the comparison, including the number of matching and differing observations and variables.
- Observation Comparison: Details specific differences at the observation level, including values from both the base and compare datasets.
- Variable Comparison: Lists variables that differ between the datasets, including discrepancies in attributes like type and length.
/* Example output interpretation */
proc compare base=sdtm.dm compare=qc.dm;
id usubjid;
run;
/* Sample output:
Comparison Results for Data Set Comparison
Summary of Results:
-------------------
Variables compared: 10
Observations compared: 100
Observations with discrepancies: 2
Observation Comparison:
------------------------
usubjid: 101
Base: Age=35, Compare: Age=36
usubjid: 102
Base: Sex=Male, Compare: Sex=Female
*/
The summary section gives you a quick snapshot of the comparison, helping you quickly assess whether the datasets are largely in agreement or if significant differences exist. The detailed observation comparison highlights exactly where the datasets diverge, making it easier to pinpoint and correct errors.
5.1. Using Output Datasets for Further Analysis
Sometimes, the comparison output needs to be analyzed further or shared with others. By directing the output to a dataset using the OUT=
option, you can perform additional analyses, create custom reports, or easily integrate the results into other processes.
/* Creating an output dataset for further analysis */
proc compare base=sdtm.vs compare=qc.vs out=comp_results outbase outcomp outdiff;
id usubjid visitnum vstestcd;
run;
/* Further analysis on the comparison results */
proc freq data=comp_results;
tables _type_;
run;
In this example, the comparison results are saved to a dataset, and a frequency analysis is performed on the comparison type. This can help identify common issues, such as frequent mismatches in certain variables or patterns in the differences.
5.2. Generating Reports from Comparison Results
For documentation or auditing purposes, you may need to generate reports from the comparison results. By using ODS (Output Delivery System) along with PROC COMPARE
, you can create professional-looking reports in various formats.
/* Generating a PDF report of comparison results */
ods pdf file="comparison_report.pdf";
proc compare base=sdtm.ae compare=qc.ae listall;
id usubjid aeseq;
run;
ods pdf close;
This example generates a PDF report that documents all differences found in the comparison, making it easy to share and review with team members or auditors.
6. Practical Applications of PROC COMPARE in SDTM Programming
In SDTM programming, PROC COMPARE
can be used for a variety of practical applications beyond simple dataset comparison. Here are some advanced use cases:
6.1. Validating Derived Variables
Use PROC COMPARE
to ensure that derived variables in SDTM datasets are calculated correctly by comparing them to independently calculated values.
/* Validating derived variables */
proc compare base=sdtm.vs compare=derived.vs;
id usubjid visitnum vstestcd;
var vsorres vsstresn;
run;
In this example, the original SDTM dataset is compared with a derived dataset to validate the derivation of variables such as vsstresn
(numeric standard result).
6.2. Cross-Checking SDTM Domains
Ensure consistency across related SDTM domains by comparing them with relevant metadata or reference datasets.
/* Cross-checking SDTM domains */
proc compare base=sdtm.ex compare=sdtm.dm;
id usubjid;
var exstdtc exendtc;
run;
This comparison checks that the exposure start and end dates in the EX domain align with the demographics data in the DM domain.
6.3. QC of SDTM Mapping
During SDTM mapping, compare the mapped datasets with the original source datasets to ensure that all required transformations were applied correctly.
/* QC of SDTM mapping */
proc compare base=raw_data compare=sdtm.mapped_data;
id usubjid visitnum;
run;
This example compares the raw source data with the mapped SDTM dataset, ensuring that the mapping process has been performed correctly.
Conclusion
PROC COMPARE
is an indispensable tool in SDTM programming, providing robust capabilities for data validation, quality control, and dataset comparison. By mastering the tips, techniques, and advanced strategies outlined in this report, you can ensure that your SDTM datasets are accurate, consistent, and ready for submission or further analysis. Whether you're dealing with simple comparisons or complex dataset validations, PROC COMPARE
offers the flexibility and power needed to meet the stringent requirements of clinical data management.