StudySAS Blog: Mastering Clinical Data Management with SAS

Welcome to StudySAS, your ultimate guide to clinical data management using SAS. We cover essential topics like SDTM, CDISC standards, and Define.XML, alongside advanced PROC SQL and SAS Macros techniques. Whether you're enhancing your programming efficiency or ensuring compliance with industry standards, StudySAS offers practical tips and insights to elevate your clinical research expertise. Join us and stay ahead in the evolving world of clinical data.

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

As SAS programmers, we often encounter situations where we need to execute a certain procedure or set of steps multiple times, typically based on different subsets of data. Manually writing out code for each instance can be time-consuming, but SAS offers a powerful tool to make this process more efficient: CALL EXECUTE.

What is CALL EXECUTE?

CALL EXECUTE is a SAS routine that allows you to dynamically generate and execute SAS code during a data step’s execution. Instead of hardcoding the logic for every individual case, CALL EXECUTE can generate the code on the fly and execute it as part of the same data step. This technique is invaluable when you have repetitive tasks across different datasets, procedures, or even report generation.

Basic Example: Dynamic PROC PRINT Execution

Let's say you have multiple datasets in the WORK library, and you want to run a PROC PRINT for each dataset. Instead of manually writing a PROC PRINT for each one, you can use CALL EXECUTE to automate this process:

proc sql;
    select cat('proc print data=', libname, '.', memname, '; run;')
        into :code_list separated by ' ' 
    from sashelp.vtable 
    where libname='WORK';
quit;

data _null_;
    call execute("&code_list");
run;

This code does the following:

The PROC SQL step queries the SAS dictionary table sashelp.vtable to generate a list of all datasets in the WORK library. It concatenates each dataset name into a PROC PRINT statement and stores them in the macro variable code_list.
The CALL EXECUTE routine inside the DATA _NULL_ step dynamically executes each PROC PRINT statement, printing each dataset without manual intervention.

Benefits of Using CALL EXECUTE

The ability to dynamically generate and execute code gives you tremendous flexibility. Here are some key benefits:

Automated Task Execution: Use CALL EXECUTE to run procedures on multiple datasets, making automation easier in iterative tasks like generating reports.
Reduced Code Duplication: Eliminate the need to manually write out repetitive code, making your programs cleaner and more maintainable.
Increased Flexibility: Dynamically adjust the logic based on changing data or parameters without modifying the core program.

Advanced Example: Conditional Execution of Procedures

In some cases, you might want to execute different procedures based on the content of the data. Here's an example where we execute PROC FREQ if a dataset contains a categorical variable, and PROC MEANS if it contains a numeric variable:

data _null_;
    set sashelp.vcolumn(where=(libname='WORK'));
    if type = 'char' then 
        call execute('proc freq data=work.' || trim(memname) || '; tables ' || name || '; run;');
    else if type = 'num' then 
        call execute('proc means data=work.' || trim(memname) || '; var ' || name || '; run;');
    run;

In this code:

The sashelp.vcolumn table provides information about the columns in each dataset, including the variable type (character or numeric).
Depending on the variable type, CALL EXECUTE runs either PROC FREQ for categorical data or PROC MEANS for numeric data.

Conclusion

Using CALL EXECUTE in SAS is an efficient way to dynamically generate and execute code, particularly in situations that involve repetitive tasks. Whether you’re working on large datasets or need to run different procedures conditionally, CALL EXECUTE can significantly simplify your workflow and reduce manual intervention. Mastering this tool will help make your SAS programming more efficient and flexible.

Have you used CALL EXECUTE in your SAS programs? Share your experiences in the comments below!

Finding Duplicate Records Across SAS Datasets in an Entire Library

Author: Sarath

Date: October 10, 2024

Introduction

In SAS programming, identifying and managing duplicate records within datasets is an essential part of data cleaning and quality control. However, when working with multiple datasets in an entire library, the task of finding duplicates becomes more complex. In this article, we will explore different ways to identify duplicate records in SAS datasets across an entire library using several approaches: PROC SORT, PROC SQL, and DATA STEP. Additionally, we will provide advanced techniques to streamline the process for multiple datasets in a library.

Why Identify Duplicates?

Duplicate records can cause significant issues in data analysis, leading to inaccurate results, inflated counts, or incorrect statistical conclusions. Detecting and addressing duplicates in your SAS datasets ensures data integrity and improves the reliability of your analysis. By finding and removing duplicates, you can avoid skewed reports and maintain high-quality datasets for regulatory or research purposes.

Methods to Find Duplicate Records in SAS

SAS offers several methods to identify duplicates. The most common approaches involve using PROC SORT with the NODUPKEY or NODUP options, PROC SQL with grouping and counting, and DATA STEP with conditional logic. Let's explore each approach and how they can be used for individual datasets as well as across an entire library.

1. Using PROC SORT

The PROC SORT procedure is one of the simplest and most common ways to identify duplicate records in a SAS dataset. You can use the NODUPKEY option to eliminate duplicates based on specific key variables. Here's an example:

proc sort data=mylib.dataset1 nodupkey;
    by key_variable1 key_variable2;
run;

This code will remove duplicate records based on the values of key_variable1 and key_variable2 from the dataset dataset1. However, it doesn’t tell you which records were duplicates. To retain duplicates in a separate dataset for further analysis, you can use the OUT= option and keep the duplicate records:

proc sort data=mylib.dataset1 out=duplicates nodupkey;
    by key_variable1 key_variable2;
run;

data duplicates_found;
    merge mylib.dataset1(in=a) duplicates(in=b);
    by key_variable1 key_variable2;
    if a and not b; /* Keep only duplicates */
run;

2. Using PROC SQL

PROC SQL is a powerful way to find duplicate records using SQL queries. You can group data by key variables and use the COUNT function to identify duplicates:

proc sql;
    create table duplicates as
    select key_variable1, key_variable2, count(*) as freq
    from mylib.dataset1
    group by key_variable1, key_variable2
    having freq > 1;
quit;

This code will create a table of duplicate records where the combination of key_variable1 and key_variable2 appears more than once in the dataset. PROC SQL is versatile, and you can easily extend it to work across multiple datasets or libraries.

3. Using DATA STEP

The DATA STEP method offers more flexibility when identifying duplicates. You can flag duplicate records and create a separate dataset containing only the duplicates:

data duplicates_found;
    set mylib.dataset1;
    by key_variable1 key_variable2;
    if first.key_variable1 and last.key_variable1 then delete;
run;

This example flags duplicates using the BY statement and the FIRST. and LAST. indicators. If a record is the first and last occurrence of a key combination, it is unique, and thus deleted. All other records are duplicates.

Advanced Techniques: Finding Duplicates in an Entire Library

When working with an entire library of datasets, manually checking each dataset for duplicates can be time-consuming. SAS provides efficient ways to automate this process, especially when dealing with a large number of datasets.

1. Using Macro Loops

A SAS macro can be used to loop through all datasets in a library and check for duplicates. Here's an example of how to automate this using PROC SQL and a macro:

%macro find_dups(lib=);
    /* Retrieve all dataset names in the library */
    proc sql noprint;
        select memname into :dslist separated by ' '
        from dictionary.tables
        where libname = upcase("&lib");
    quit;

    /* Loop through each dataset and find duplicates */
    %let count = 1;
    %let dsname = %scan(&dslist, &count);
    %do %while (&dsname ne );
        proc sql;
            create table duplicates_&dsname as
            select key_variable1, key_variable2, count(*) as freq
            from &lib..&dsname
            group by key_variable1, key_variable2
            having freq > 1;
        quit;

        %let count = %eval(&count + 1);
        %let dsname = %scan(&dslist, &count);
    %end;
%mend find_dups;

/* Call the macro to find duplicates in all datasets */
%find_dups(lib=mylib);

This macro loops through every dataset in the library mylib, checks for duplicates based on key variables, and creates a separate dataset with duplicates for each dataset processed. This method saves time by automating the search across all datasets.

2. Using Dictionary Tables

Another approach to process datasets across an entire library is by leveraging SAS dictionary tables. Dictionary tables store metadata about your SAS environment, including information about datasets in a library. Here's an example of how to use dictionary tables to find duplicate records:

proc sql;
    select memname into :dslist separated by ' '
    from dictionary.tables
    where libname='MYLIB';
quit;

%macro find_duplications;
    %do i=1 %to %sysfunc(countw(&dslist));
        %let dsname=%scan(&dslist, &i);

        proc sort data=mylib.&dsname out=dup_&dsname nodupkey;
            by key_variable1 key_variable2;
        run;

        /* Check for duplicates in the sorted dataset */
        data check_dup_&dsname;
            merge mylib.&dsname(in=a) dup_&dsname(in=b);
            by key_variable1 key_variable2;
            if a and not b;
        run;
    %end;
%mend;

%find_duplications;

This macro uses dictionary tables to get a list of datasets and then applies a sort and merge process to find duplicates. It produces a dataset for each original dataset that contains only the duplicate records.

Conclusion

Identifying duplicate records across multiple datasets in a library is a common task in data quality control. Using techniques like PROC SORT, PROC SQL, and DATA STEP, you can efficiently find and handle duplicates. For larger projects involving multiple datasets, leveraging macros and dictionary tables allows you to automate the process, saving time and reducing errors.

These methods not only help in managing duplicates but also ensure the integrity of your data, leading to more accurate and reliable results in your analysis. Incorporate these techniques into your workflow to improve your data management practices.

Comprehensive Guide to Define.xml Package Generation and QC Process

Author: Sarath

Date: October 10, 2024

Introduction

The Define.xml file, also known as the Case Report Tabulation Data Definition (CRT-DD), is a key component in regulatory submissions for clinical trials. It describes the metadata for the datasets submitted to regulatory agencies such as the FDA and EMA, providing transparency and traceability for clinical trial data. In this post, we’ll explore both the steps involved in generating the Define.xml package and the necessary Quality Control (QC) process to ensure its accuracy and compliance with regulatory requirements.

What is Define.xml and Why Is It Important?

The Define.xml file serves as the metadata backbone for clinical trial datasets submitted for regulatory review. It describes the structure and relationships of the datasets, variables, controlled terminologies, and derivations in the submission. Regulatory reviewers rely on the Define.xml file to understand the data, its origins, and how derived variables were created. A well-constructed Define.xml file ensures smooth data review and promotes transparency.

The Define.xml is mandatory for submissions following CDISC (Clinical Data Interchange Standards Consortium) standards, such as SDTM (Study Data Tabulation Model) and ADaM (Analysis Data Model) datasets.

Steps for Define.xml Package Generation

1. Metadata Preparation

The first step is to prepare the metadata for all datasets and variables included in the submission. This includes:

Dataset metadata: The names, labels, and descriptions of each dataset.
Variable metadata: Details for each variable, including its name, type (character or numeric), length, format, controlled terminologies (if applicable), and derivations.
Value-level metadata: When applicable, value-level metadata is necessary for variables that may take different attributes based on specific values.

This metadata is often compiled in spreadsheets or specialized data definition tables within your programming environment.

2. Controlled Terminology Setup

Controlled terminology plays a crucial role in ensuring that values used in datasets are standardized. For example, MedDRA (Medical Dictionary for Regulatory Activities) is commonly used for adverse event terms, while CDISC-controlled terminology is used for other data points. Ensure that your controlled terminology is up-to-date with the latest regulatory requirements.

3. Defining Derivation Rules

All derived variables should be clearly documented, including how they were calculated or derived from other variables in the dataset. This step ensures that the regulatory agency understands how complex variables were generated and can trace them back to their raw origins.

4. Generate Define.xml File Using Tools

Tools like Pinnacle 21 or OpenCDISC can be used to generate the Define.xml file from the prepared metadata. These tools automate the conversion of metadata into the XML format required by regulatory agencies. Here’s how the generation process typically works:

Input your metadata into the tool (often via Excel spreadsheets or metadata tables).
The tool generates the Define.xml file and any associated codelist files.
The output is an XML file that can be submitted along with the clinical datasets.

5. Assemble the Define.xml Package

The complete Define.xml package includes:

Define.xml file
Annotated CRF (Case Report Form)
Study Data Reviewer’s Guide (SDRG) and Analysis Data Reviewer’s Guide (ADRG), if applicable

Ensure all necessary documentation is compiled as part of the submission package.

Quality Control (QC) Process for Define.xml

Once the Define.xml file is generated, it must undergo a rigorous QC process to ensure compliance with CDISC standards and avoid issues during regulatory review. Below are the key steps in the QC process:

1. Validate Using Pinnacle 21

One of the most important QC steps is to validate the Define.xml file using a tool like Pinnacle 21. This tool checks your file against CDISC standards and provides a report highlighting any potential errors or warnings. Some common issues that are flagged during validation include:

Missing or incorrect metadata
Inconsistencies in variable attributes (e.g., variable length or type)
Unreferenced codelists or controlled terminologies

Always review the validation report carefully and resolve any issues before submission.

2. Cross-Check Metadata Against Raw Data

A crucial aspect of QC is to cross-check the metadata in the Define.xml file against the raw and derived datasets. This ensures that the variable names, labels, and formats specified in the metadata align with the actual datasets submitted. Common checks include:

Are the variable names and labels consistent between the datasets and the Define.xml file?
Do the controlled terminologies used match those in the datasets?
Are the derivations correctly documented and traceable?

3. Check for Completeness and Accuracy

Ensuring completeness is critical. Each dataset, variable, codelist, and derivation that is part of your submission must be documented in the Define.xml. Missing or incomplete metadata can lead to delays in regulatory review. During QC, verify the following:

Every dataset and variable is present in the Define.xml file.
All codelists are correctly referenced, and their values match the dataset contents.
Derived variables have clear and complete descriptions of how they were calculated.

4. Verify Value-Level Metadata (If Applicable)

For variables that require value-level metadata (e.g., variables that behave differently based on their values), verify that the detailed metadata is present and correct. Ensure that any conditions described for value-level metadata accurately reflect the dataset contents.

5. Manual Review of XML File

While automated tools like Pinnacle 21 are invaluable, it is also important to perform a manual review of the XML file. Open the Define.xml file in a text editor or XML viewer and check for any formatting issues, such as missing tags or improperly nested elements.

6. Documentation and Sign-Off

Once the QC process is complete and all issues have been resolved, document the QC activities. This can include a QC checklist or summary that describes the steps taken to validate the file. Obtain sign-off from team members or stakeholders to confirm that the Define.xml file is ready for submission.

Common Pitfalls and How to Avoid Them

Below are some common pitfalls encountered during Define.xml generation and QC, along with tips on how to avoid them:

Outdated Controlled Terminology: Ensure you’re using the most up-to-date versions of controlled terminologies (e.g., MedDRA, CDISC).
Inconsistent Metadata: Cross-check metadata between the Define.xml file and datasets to prevent mismatches.
Missing Documentation: Don’t overlook the need for additional documents like the Annotated CRF and Reviewer’s Guide.
Overlooking Value-Level Metadata: If required, always include value-level metadata and double-check its accuracy.
Skipping Manual Review: While validation tools are helpful, always conduct a manual review of the XML file to catch formatting issues that may not be flagged by automated tools.

Conclusion

Generating and validating a Define.xml package is a critical part of clinical trial submissions. By following a structured approach to both generation and QC, you can ensure your submission meets regulatory standards and avoid potential delays during the review process. Always use tools like Pinnacle 21 for validation, but don’t forget the importance of manual review and cross-checking metadata for completeness and accuracy.

Investing time in the QC process is essential for a successful submission, as a properly validated Define.xml file can facilitate faster and smoother regulatory review. Incorporate these best practices into your workflow to ensure compliance and to enhance the quality of your submissions.

Disclosure:

In the spirit of transparency and innovation, I want to share that some of the content on this blog is generated with the assistance of ChatGPT, an AI language model developed by OpenAI. While I use this tool to help brainstorm ideas and draft content, every post is carefully reviewed, edited, and personalized by me to ensure it aligns with my voice, values, and the needs of my readers. My goal is to provide you with accurate, valuable, and engaging content, and I believe that using AI as a creative aid helps achieve that. If you have any questions or feedback about this approach, feel free to reach out. Your trust and satisfaction are my top priorities.

Discover More Tips and Techniques on This Blog

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

What is CALL EXECUTE?

Basic Example: Dynamic PROC PRINT Execution

Benefits of Using CALL EXECUTE

Advanced Example: Conditional Execution of Procedures

Conclusion

Introduction

Why Identify Duplicates?

Methods to Find Duplicate Records in SAS

1. Using PROC SORT

2. Using PROC SQL

3. Using DATA STEP

Advanced Techniques: Finding Duplicates in an Entire Library

1. Using Macro Loops

2. Using Dictionary Tables

Conclusion

Introduction

What is Define.xml and Why Is It Important?

Steps for Define.xml Package Generation

1. Metadata Preparation

2. Controlled Terminology Setup

3. Defining Derivation Rules

4. Generate Define.xml File Using Tools

5. Assemble the Define.xml Package

Quality Control (QC) Process for Define.xml

1. Validate Using Pinnacle 21

2. Cross-Check Metadata Against Raw Data

3. Check for Completeness and Accuracy

4. Verify Value-Level Metadata (If Applicable)

5. Manual Review of XML File

6. Documentation and Sign-Off

Common Pitfalls and How to Avoid Them

Conclusion

Disclosure: