Discover More Tips and Techniques on This Blog

Advanced SAS programming Techniques for SDTM implementation

Advanced SAS Programming Techniques for SDTM Implementation

Date: November 3, 2024

In the realm of clinical trials data management, SDTM (Study Data Tabulation Model) implementation requires sophisticated programming techniques to ensure data accuracy and compliance. This article explores advanced SAS programming methods that can streamline SDTM dataset creation and validation.

1. Efficient Variable Derivation Using Hash Objects

Hash objects in SAS provide a powerful way to perform quick lookups and merges, especially useful when dealing with large SDTM datasets.

data work.ae;
if _n_ = 1 then do;
declare hash h_dm(dataset: "sdtm.dm");
h_dm.definekey("usubjid");
h_dm.definedata("age", "sex", "race");
h_dm.definedone();
end;
set raw.ae;
rc = h_dm.find();
/* Continue processing */
run;

Pro Tip: Hash objects remain in memory throughout the DATA step, making them more efficient than traditional merge operations for large datasets.

2. Standardizing Controlled Terminology with Format Catalogs

Creating and maintaining CDISC-compliant terminology is crucial for SDTM implementation.

proc format library=library.sdtm_formats;
value $severity
'MILD' = 'MILD'
'MOD' = 'MODERATE'
'MODERATE' = 'MODERATE'
'SEV' = 'SEVERE'
'SEVERE' = 'SEVERE'
other = 'UNKNOWN';
run;

data sdtm.ae;
set work.ae;
aesev = put(raw_severity, $severity.);
run;

3. Macro Systems for Dynamic SDTM Generation

Developing reusable macro systems can significantly improve efficiency and reduce errors in SDTM implementation.

%macro create_supp(domain=, vars=);
proc sql noprint;
select distinct usubjid, &vars
into :subjids separated by ',',
:values separated by ','
from sdtm.&domain;
quit;

data sdtm.supp&domain;
set sdtm.&domain(keep=usubjid &vars);
length qnam $8 qlabel $40 qval $200;
/* Generate supplemental qualifiers */
run;
%mend create_supp;

4. Advanced Error Checking and Validation

Implementing robust error-checking mechanisms ensures data quality and compliance with SDTM standards.

%macro validate_domain(domain=);
proc sql noprint;
/* Check for duplicate records */
create table work.duplicates as
select *, count(*) as count
from sdtm.&domain
group by usubjid, &domain.dtc
having count > 1;
/* Verify required variables */
select name into :reqvars separated by ' '
from sashelp.vcolumn
where libname='SDTM' and memname=upcase("&domain")
and name in ('USUBJID', 'DOMAIN', "&domain.SEQ");
quit;
%mend validate_domain;

5. Handling Custom Domains and Extensions

Sometimes, standard SDTM domains need to be extended to accommodate study-specific requirements.

proc sql;
create table sdtm.custom_domain as
select a.usubjid,
a.visit,
b.startdt,
calculated enddt format=datetime20.
from derived.custom_data as a
left join sdtm.sv as b
on a.usubjid = b.usubjid
and a.visit = b.visit;
quit;

6. Optimizing Performance for Large Studies

When dealing with large studies, performance optimization becomes crucial:

  1. Use WHERE clauses instead of IF statements when possible
  2. Implement parallel processing for independent domains
  3. Optimize sort operations using PROC SORT NODUPKEY
options mprint mlogic symbolgen;
%let parallel_domains = ae cm eg lb mh vs;

%macro process_domains;
%do i = 1 %to %sysfunc(countw(¶llel_domains));
%let domain = %scan(¶llel_domains, &i);
%submit;
%create_domain(domain=&domain)
%endsubmit;
%end;
%mend process_domains;

Best Practice: Always document your code thoroughly and include version control information for traceability.

Conclusion

Mastering these advanced SAS programming techniques can significantly improve the efficiency and quality of SDTM implementation. Remember to always validate your outputs against SDTM Implementation Guide requirements and maintain clear documentation of your programming decisions.

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

As SAS programmers, we often encounter situations where we need to execute a certain procedure or set of steps multiple times, typically based on different subsets of data. Manually writing out code for each instance can be time-consuming, but SAS offers a powerful tool to make this process more efficient: CALL EXECUTE.

What is CALL EXECUTE?

CALL EXECUTE is a SAS routine that allows you to dynamically generate and execute SAS code during a data step’s execution. Instead of hardcoding the logic for every individual case, CALL EXECUTE can generate the code on the fly and execute it as part of the same data step. This technique is invaluable when you have repetitive tasks across different datasets, procedures, or even report generation.

Basic Example: Dynamic PROC PRINT Execution

Let's say you have multiple datasets in the WORK library, and you want to run a PROC PRINT for each dataset. Instead of manually writing a PROC PRINT for each one, you can use CALL EXECUTE to automate this process:

proc sql;
    select cat('proc print data=', libname, '.', memname, '; run;')
        into :code_list separated by ' ' 
    from sashelp.vtable 
    where libname='WORK';
quit;

data _null_;
    call execute("&code_list");
run;

This code does the following:

  • The PROC SQL step queries the SAS dictionary table sashelp.vtable to generate a list of all datasets in the WORK library. It concatenates each dataset name into a PROC PRINT statement and stores them in the macro variable code_list.
  • The CALL EXECUTE routine inside the DATA _NULL_ step dynamically executes each PROC PRINT statement, printing each dataset without manual intervention.

Benefits of Using CALL EXECUTE

The ability to dynamically generate and execute code gives you tremendous flexibility. Here are some key benefits:

  • Automated Task Execution: Use CALL EXECUTE to run procedures on multiple datasets, making automation easier in iterative tasks like generating reports.
  • Reduced Code Duplication: Eliminate the need to manually write out repetitive code, making your programs cleaner and more maintainable.
  • Increased Flexibility: Dynamically adjust the logic based on changing data or parameters without modifying the core program.

Advanced Example: Conditional Execution of Procedures

In some cases, you might want to execute different procedures based on the content of the data. Here's an example where we execute PROC FREQ if a dataset contains a categorical variable, and PROC MEANS if it contains a numeric variable:

data _null_;
    set sashelp.vcolumn(where=(libname='WORK'));
    if type = 'char' then 
        call execute('proc freq data=work.' || trim(memname) || '; tables ' || name || '; run;');
    else if type = 'num' then 
        call execute('proc means data=work.' || trim(memname) || '; var ' || name || '; run;');
    run;

In this code:

  • The sashelp.vcolumn table provides information about the columns in each dataset, including the variable type (character or numeric).
  • Depending on the variable type, CALL EXECUTE runs either PROC FREQ for categorical data or PROC MEANS for numeric data.

Conclusion

Using CALL EXECUTE in SAS is an efficient way to dynamically generate and execute code, particularly in situations that involve repetitive tasks. Whether you’re working on large datasets or need to run different procedures conditionally, CALL EXECUTE can significantly simplify your workflow and reduce manual intervention. Mastering this tool will help make your SAS programming more efficient and flexible.

Have you used CALL EXECUTE in your SAS programs? Share your experiences in the comments below!

Finding Duplicate Records Across SAS Datasets in an Entire Library

Finding Duplicate Records Across SAS Datasets in an Entire Library

Author: Sarath

Date: October 10, 2024

Introduction

In SAS programming, identifying and managing duplicate records within datasets is an essential part of data cleaning and quality control. However, when working with multiple datasets in an entire library, the task of finding duplicates becomes more complex. In this article, we will explore different ways to identify duplicate records in SAS datasets across an entire library using several approaches: PROC SORT, PROC SQL, and DATA STEP. Additionally, we will provide advanced techniques to streamline the process for multiple datasets in a library.

Why Identify Duplicates?

Duplicate records can cause significant issues in data analysis, leading to inaccurate results, inflated counts, or incorrect statistical conclusions. Detecting and addressing duplicates in your SAS datasets ensures data integrity and improves the reliability of your analysis. By finding and removing duplicates, you can avoid skewed reports and maintain high-quality datasets for regulatory or research purposes.

Methods to Find Duplicate Records in SAS

SAS offers several methods to identify duplicates. The most common approaches involve using PROC SORT with the NODUPKEY or NODUP options, PROC SQL with grouping and counting, and DATA STEP with conditional logic. Let's explore each approach and how they can be used for individual datasets as well as across an entire library.

1. Using PROC SORT

The PROC SORT procedure is one of the simplest and most common ways to identify duplicate records in a SAS dataset. You can use the NODUPKEY option to eliminate duplicates based on specific key variables. Here's an example:

proc sort data=mylib.dataset1 nodupkey;
    by key_variable1 key_variable2;
run;
        

This code will remove duplicate records based on the values of key_variable1 and key_variable2 from the dataset dataset1. However, it doesn’t tell you which records were duplicates. To retain duplicates in a separate dataset for further analysis, you can use the OUT= option and keep the duplicate records:

proc sort data=mylib.dataset1 out=duplicates nodupkey;
    by key_variable1 key_variable2;
run;

data duplicates_found;
    merge mylib.dataset1(in=a) duplicates(in=b);
    by key_variable1 key_variable2;
    if a and not b; /* Keep only duplicates */
run;
        

2. Using PROC SQL

PROC SQL is a powerful way to find duplicate records using SQL queries. You can group data by key variables and use the COUNT function to identify duplicates:

proc sql;
    create table duplicates as
    select key_variable1, key_variable2, count(*) as freq
    from mylib.dataset1
    group by key_variable1, key_variable2
    having freq > 1;
quit;
        

This code will create a table of duplicate records where the combination of key_variable1 and key_variable2 appears more than once in the dataset. PROC SQL is versatile, and you can easily extend it to work across multiple datasets or libraries.

3. Using DATA STEP

The DATA STEP method offers more flexibility when identifying duplicates. You can flag duplicate records and create a separate dataset containing only the duplicates:

data duplicates_found;
    set mylib.dataset1;
    by key_variable1 key_variable2;
    if first.key_variable1 and last.key_variable1 then delete;
run;
        

This example flags duplicates using the BY statement and the FIRST. and LAST. indicators. If a record is the first and last occurrence of a key combination, it is unique, and thus deleted. All other records are duplicates.

Advanced Techniques: Finding Duplicates in an Entire Library

When working with an entire library of datasets, manually checking each dataset for duplicates can be time-consuming. SAS provides efficient ways to automate this process, especially when dealing with a large number of datasets.

1. Using Macro Loops

A SAS macro can be used to loop through all datasets in a library and check for duplicates. Here's an example of how to automate this using PROC SQL and a macro:

%macro find_dups(lib=);
    /* Retrieve all dataset names in the library */
    proc sql noprint;
        select memname into :dslist separated by ' '
        from dictionary.tables
        where libname = upcase("&lib");
    quit;

    /* Loop through each dataset and find duplicates */
    %let count = 1;
    %let dsname = %scan(&dslist, &count);
    %do %while (&dsname ne );
        proc sql;
            create table duplicates_&dsname as
            select key_variable1, key_variable2, count(*) as freq
            from &lib..&dsname
            group by key_variable1, key_variable2
            having freq > 1;
        quit;

        %let count = %eval(&count + 1);
        %let dsname = %scan(&dslist, &count);
    %end;
%mend find_dups;

/* Call the macro to find duplicates in all datasets */
%find_dups(lib=mylib);
        

This macro loops through every dataset in the library mylib, checks for duplicates based on key variables, and creates a separate dataset with duplicates for each dataset processed. This method saves time by automating the search across all datasets.

2. Using Dictionary Tables

Another approach to process datasets across an entire library is by leveraging SAS dictionary tables. Dictionary tables store metadata about your SAS environment, including information about datasets in a library. Here's an example of how to use dictionary tables to find duplicate records:

proc sql;
    select memname into :dslist separated by ' '
    from dictionary.tables
    where libname='MYLIB';
quit;

%macro find_duplications;
    %do i=1 %to %sysfunc(countw(&dslist));
        %let dsname=%scan(&dslist, &i);

        proc sort data=mylib.&dsname out=dup_&dsname nodupkey;
            by key_variable1 key_variable2;
        run;

        /* Check for duplicates in the sorted dataset */
        data check_dup_&dsname;
            merge mylib.&dsname(in=a) dup_&dsname(in=b);
            by key_variable1 key_variable2;
            if a and not b;
        run;
    %end;
%mend;

%find_duplications;
        

This macro uses dictionary tables to get a list of datasets and then applies a sort and merge process to find duplicates. It produces a dataset for each original dataset that contains only the duplicate records.

Conclusion

Identifying duplicate records across multiple datasets in a library is a common task in data quality control. Using techniques like PROC SORT, PROC SQL, and DATA STEP, you can efficiently find and handle duplicates. For larger projects involving multiple datasets, leveraging macros and dictionary tables allows you to automate the process, saving time and reducing errors.

These methods not only help in managing duplicates but also ensure the integrity of your data, leading to more accurate and reliable results in your analysis. Incorporate these techniques into your workflow to improve your data management practices.

Have questions or additional tips on finding duplicates in SAS datasets? Share your thoughts in the comments below!

Comprehensive Guide to Define.xml Package Generation and QC Process

Comprehensive Guide to Define.xml Package Generation and QC Process

Author: Sarath

Date: October 10, 2024

Introduction

The Define.xml file, also known as the Case Report Tabulation Data Definition (CRT-DD), is a key component in regulatory submissions for clinical trials. It describes the metadata for the datasets submitted to regulatory agencies such as the FDA and EMA, providing transparency and traceability for clinical trial data. In this post, we’ll explore both the steps involved in generating the Define.xml package and the necessary Quality Control (QC) process to ensure its accuracy and compliance with regulatory requirements.

What is Define.xml and Why Is It Important?

The Define.xml file serves as the metadata backbone for clinical trial datasets submitted for regulatory review. It describes the structure and relationships of the datasets, variables, controlled terminologies, and derivations in the submission. Regulatory reviewers rely on the Define.xml file to understand the data, its origins, and how derived variables were created. A well-constructed Define.xml file ensures smooth data review and promotes transparency.

The Define.xml is mandatory for submissions following CDISC (Clinical Data Interchange Standards Consortium) standards, such as SDTM (Study Data Tabulation Model) and ADaM (Analysis Data Model) datasets.

Steps for Define.xml Package Generation

1. Metadata Preparation

The first step is to prepare the metadata for all datasets and variables included in the submission. This includes:

  • Dataset metadata: The names, labels, and descriptions of each dataset.
  • Variable metadata: Details for each variable, including its name, type (character or numeric), length, format, controlled terminologies (if applicable), and derivations.
  • Value-level metadata: When applicable, value-level metadata is necessary for variables that may take different attributes based on specific values.

This metadata is often compiled in spreadsheets or specialized data definition tables within your programming environment.

2. Controlled Terminology Setup

Controlled terminology plays a crucial role in ensuring that values used in datasets are standardized. For example, MedDRA (Medical Dictionary for Regulatory Activities) is commonly used for adverse event terms, while CDISC-controlled terminology is used for other data points. Ensure that your controlled terminology is up-to-date with the latest regulatory requirements.

3. Defining Derivation Rules

All derived variables should be clearly documented, including how they were calculated or derived from other variables in the dataset. This step ensures that the regulatory agency understands how complex variables were generated and can trace them back to their raw origins.

4. Generate Define.xml File Using Tools

Tools like Pinnacle 21 or OpenCDISC can be used to generate the Define.xml file from the prepared metadata. These tools automate the conversion of metadata into the XML format required by regulatory agencies. Here’s how the generation process typically works:

  • Input your metadata into the tool (often via Excel spreadsheets or metadata tables).
  • The tool generates the Define.xml file and any associated codelist files.
  • The output is an XML file that can be submitted along with the clinical datasets.

5. Assemble the Define.xml Package

The complete Define.xml package includes:

  • Define.xml file
  • Annotated CRF (Case Report Form)
  • Study Data Reviewer’s Guide (SDRG) and Analysis Data Reviewer’s Guide (ADRG), if applicable

Ensure all necessary documentation is compiled as part of the submission package.

Quality Control (QC) Process for Define.xml

Once the Define.xml file is generated, it must undergo a rigorous QC process to ensure compliance with CDISC standards and avoid issues during regulatory review. Below are the key steps in the QC process:

1. Validate Using Pinnacle 21

One of the most important QC steps is to validate the Define.xml file using a tool like Pinnacle 21. This tool checks your file against CDISC standards and provides a report highlighting any potential errors or warnings. Some common issues that are flagged during validation include:

  • Missing or incorrect metadata
  • Inconsistencies in variable attributes (e.g., variable length or type)
  • Unreferenced codelists or controlled terminologies

Always review the validation report carefully and resolve any issues before submission.

2. Cross-Check Metadata Against Raw Data

A crucial aspect of QC is to cross-check the metadata in the Define.xml file against the raw and derived datasets. This ensures that the variable names, labels, and formats specified in the metadata align with the actual datasets submitted. Common checks include:

  • Are the variable names and labels consistent between the datasets and the Define.xml file?
  • Do the controlled terminologies used match those in the datasets?
  • Are the derivations correctly documented and traceable?

3. Check for Completeness and Accuracy

Ensuring completeness is critical. Each dataset, variable, codelist, and derivation that is part of your submission must be documented in the Define.xml. Missing or incomplete metadata can lead to delays in regulatory review. During QC, verify the following:

  • Every dataset and variable is present in the Define.xml file.
  • All codelists are correctly referenced, and their values match the dataset contents.
  • Derived variables have clear and complete descriptions of how they were calculated.

4. Verify Value-Level Metadata (If Applicable)

For variables that require value-level metadata (e.g., variables that behave differently based on their values), verify that the detailed metadata is present and correct. Ensure that any conditions described for value-level metadata accurately reflect the dataset contents.

5. Manual Review of XML File

While automated tools like Pinnacle 21 are invaluable, it is also important to perform a manual review of the XML file. Open the Define.xml file in a text editor or XML viewer and check for any formatting issues, such as missing tags or improperly nested elements.

6. Documentation and Sign-Off

Once the QC process is complete and all issues have been resolved, document the QC activities. This can include a QC checklist or summary that describes the steps taken to validate the file. Obtain sign-off from team members or stakeholders to confirm that the Define.xml file is ready for submission.

Common Pitfalls and How to Avoid Them

Below are some common pitfalls encountered during Define.xml generation and QC, along with tips on how to avoid them:

  • Outdated Controlled Terminology: Ensure you’re using the most up-to-date versions of controlled terminologies (e.g., MedDRA, CDISC).
  • Inconsistent Metadata: Cross-check metadata between the Define.xml file and datasets to prevent mismatches.
  • Missing Documentation: Don’t overlook the need for additional documents like the Annotated CRF and Reviewer’s Guide.
  • Overlooking Value-Level Metadata: If required, always include value-level metadata and double-check its accuracy.
  • Skipping Manual Review: While validation tools are helpful, always conduct a manual review of the XML file to catch formatting issues that may not be flagged by automated tools.

Conclusion

Generating and validating a Define.xml package is a critical part of clinical trial submissions. By following a structured approach to both generation and QC, you can ensure your submission meets regulatory standards and avoid potential delays during the review process. Always use tools like Pinnacle 21 for validation, but don’t forget the importance of manual review and cross-checking metadata for completeness and accuracy.

Investing time in the QC process is essential for a successful submission, as a properly validated Define.xml file can facilitate faster and smoother regulatory review. Incorporate these best practices into your workflow to ensure compliance and to enhance the quality of your submissions.

Have questions or additional insights on Define.xml generation and QC? Share your thoughts in the comments below!

The Power of RETAIN Statement in SAS Programming: Advantages and Use Cases

The Power of RETAIN Statement in SAS Programming: Advantages and Use Cases

Author: Sarath

Date: October 10, 2024

Introduction

The RETAIN statement in SAS is a powerful tool used to control the behavior of variables across iterations in a data step. Unlike standard SAS variables, which are reset at the beginning of each iteration, RETAIN allows you to preserve the value of a variable from one iteration to the next. In this blog post, we will explore the advantages and use cases of the RETAIN statement in SAS programming, including controlling variable order, and provide practical examples.

Advantages of the RETAIN Statement

  • Preserve Values Across Iterations: The primary advantage of using the RETAIN statement is its ability to retain values across data step iterations. This feature is particularly useful when creating cumulative sums, counters, or when you need to remember values from a previous observation.
  • Improve Performance: The RETAIN statement can improve the efficiency of a program by eliminating the need for complex MERGE or PROC SQL steps. It simplifies the logic for tasks that require comparing current and previous observations.
  • Enhance Code Readability: By using RETAIN, you can avoid writing multiple lines of code to carry forward values. This makes your code cleaner and easier to understand.
  • Control Variable Order: The RETAIN statement allows you to explicitly specify the order in which variables appear in the output dataset. This is particularly useful when the default order (based on the order in which variables are created) does not meet your needs.

Common Use Cases of the RETAIN Statement

1. Cumulative Sums

The RETAIN statement is often used to calculate cumulative sums. For example, let's say you have a dataset with daily sales, and you want to calculate the total sales up to each day:

data cumulative_sales;
    set daily_sales;
    retain total_sales 0;
    total_sales = total_sales + sales;
run;
        

In this example, RETAIN ensures that the value of total_sales is carried forward from one observation to the next, allowing us to accumulate the total sales for each day.

2. Carry Forward Last Non-Missing Value

Another common use case is carrying forward the last non-missing value across observations. Here's an example where you want to carry the last valid value of a variable forward:

data carry_forward;
    set mydata;
    retain last_value;
    if not missing(value) then last_value = value;
run;
        

In this code, the RETAIN statement ensures that the variable last_value keeps its value until a new non-missing value is encountered.

3. Sequential Numbering or Counters

The RETAIN statement can also be used for counting occurrences or assigning sequential numbers to observations based on certain conditions:

data numbering;
    set events;
    retain event_count 0;
    if event = 'Yes' then event_count + 1;
run;
        

In this example, event_count increments by 1 whenever the event occurs, creating a sequential count of events.

4. Controlling Variable Order in the Output Dataset

In SAS, the default variable order in the output dataset is based on the order in which the variables are created. However, in some cases, you may want to control the order of the variables explicitly. The RETAIN statement allows you to achieve this. Here's an example:

data control_order;
    retain id name age salary; /* Specifying variable order */
    set employee_data;
    salary = salary * 1.1; /* Example of updating a variable */
run;
        

In this example, the RETAIN statement is used to specify the order in which the variables id, name, age, and salary will appear in the output dataset. Even though the salary variable is updated later in the data step, it will appear last in the specified order.

When to Use RETAIN vs. Other Methods

While the RETAIN statement is useful, there are other techniques such as FIRST. and LAST. variables, or MERGE with BY statements, that may serve similar purposes. However, RETAIN is generally more efficient for simple tasks such as accumulating values, counting, or controlling variable order.

Conclusion

The RETAIN statement is a valuable feature in SAS programming that can simplify your code and improve efficiency. Whether you're calculating cumulative sums, carrying forward non-missing values, creating counters, or controlling variable order, understanding how to use RETAIN will help you develop more effective SAS programs. Incorporate it wisely into your data steps to optimize your workflows!

Have questions or additional examples? Feel free to leave a comment below!

Understanding ADAPT in the SDTM TS Domain: Adaptive vs Non-Adaptive Trials

Understanding ADAPT in the SDTM TS Domain: Adaptive vs Non-Adaptive Trials

The Study Data Tabulation Model (SDTM) plays a critical role in organizing and submitting clinical trial data. One of the parameters that regulatory agencies look for in the Trial Summary (TS) domain is the ADAPT parameter (TSPARMCD=ADAPT), which indicates whether the trial follows an adaptive design. In this blog post, we will explore the meaning of ADAPT and provide examples of adaptive and non-adaptive trials.

What is ADAPT in the TS Domain?

The ADAPT parameter identifies whether the clinical trial is adaptive (ADAPT=Y) or non-adaptive (ADAPT=N). An adaptive trial allows for modifications to the study design based on interim results, making the trial more flexible and often more efficient.

"Adaptive clinical trials allow for changes in design or hypotheses during the study based on accumulating data, without undermining the validity or integrity of the trial."

Example 1: Non-Adaptive Trial (ADAPT = N)

A non-adaptive trial follows a fixed protocol and does not allow for any changes during the study. Most traditional randomized controlled trials (RCTs) fall into this category. For example, a phase III trial that tests a drug against a placebo in a predefined number of patients without any modifications would be classified as non-adaptive.

STUDYID TSPARMCD TSVAL
ABC123 ADAPT N

In this case, the study ABC123 is a non-adaptive trial with no pre-planned modifications allowed during the course of the trial.

Example 2: Adaptive Trial (ADAPT = Y)

An adaptive trial allows changes to be made during the study based on interim analyses. These changes might include modifying sample size, adjusting dosing regimens, or even dropping treatment arms. Adaptive trials are common in oncology and rare disease studies, where efficient trial design is crucial due to limited patient populations.

For example, a phase II oncology trial might allow for dose adjustments or early termination based on early data. In this case, the trial would be classified as adaptive.

STUDYID TSPARMCD TSVAL
DEF456 ADAPT Y

The study DEF456 is an adaptive trial where the protocol allows for changes based on interim analysis.

Key Considerations for Adaptive Trials

When implementing an adaptive trial, it's essential to plan for certain regulatory and statistical considerations:

  • Pre-Specified Rules: Adaptations must be pre-specified in the protocol and reviewed by regulatory bodies.
  • Interim Analyses: Interim analyses require statistical rigor to avoid bias or misleading results.
  • Regulatory Approval: Regulatory agencies such as the FDA and EMA provide specific guidelines for adaptive trials, which must be strictly followed.

When is TSVAL set to "Y" for TSPARMCD=ADAPT?

The TSVAL variable is set to "Y" (Yes) for TSPARMCD=ADAPT if the study incorporates an adaptive design. An adaptive design allows for certain changes during the trial without compromising its validity. Examples of common adaptive designs include:

  • Sample Size Re-estimation: Adjusting the sample size based on interim data to ensure adequate power.
  • Early Stopping for Efficacy or Futility: Halting the trial early based on strong interim results or low likelihood of success.
  • Dose Adjustment: Changing dose levels according to participant responses.
  • Group Sequential Design: Using planned interim analyses to decide if the trial should continue or be modified.

If any of these design aspects apply, TSVAL for TSPARMCD=ADAPT would be "Y". Otherwise, it would be set to "N" for non-adaptive, fixed designs.

Example TS Domain Table

Here’s an example representation in the TS domain:

TSPARMCD TSPARM TSVAL Description
ADAPT Adaptive Design Y Indicates that the study has an adaptive design approach.

In regulatory submissions, such as to the FDA or PMDA, defining adaptive design parameters helps reviewers understand study flexibility and methods for ensuring trial integrity.

Conclusion

Understanding whether a trial is adaptive or non-adaptive is crucial for interpreting clinical trial data. Adaptive trials offer flexibility and efficiency but come with additional regulatory and statistical challenges. The ADAPT parameter in the TS domain provides a quick way to identify whether a trial has an adaptive design, allowing for more informed data review and analysis.

References

SDTM Trial Summary Domain: ACTSUB vs Screen Failures

Understanding SDTM Trial Summary Domain: ACTSUB vs Screen Failures

In the world of clinical data management, the Study Data Tabulation Model (SDTM) plays a vital role in organizing and submitting clinical trial data to regulatory agencies. One of the most essential domains in SDTM is the Trial Summary (TS) domain, which provides key information about the clinical trial itself.

In this blog post, we will explore the Actual Number of Subjects (ACTSUB) and how it differs from screen failures. We will also reference regulatory guidelines and SDTM Implementation Guides to ensure a deeper understanding.

What is the TS Domain?

The Trial Summary (TS) domain contains high-level information about the clinical trial. This includes essential data such as the number of subjects, the start and end dates of the trial, trial objectives, and much more. The TSPARMCD variable defines various parameters such as the number of subjects or study arms in the trial.

What is TSPARMCD=ACTSUB?

ACTSUB stands for the "Actual Number of Subjects" in a clinical trial. This variable represents the number of participants who actually started the treatment or intervention after passing the screening phase.

"The actual number of subjects refers to the total number of participants who were enrolled in the study and received at least one treatment or underwent a key study procedure."

This means that screen failures—subjects who were screened but did not qualify to proceed—are typically excluded from this count. Regulatory agencies such as the FDA and EMA expect only those subjects who participated in the study to be counted under ACTSUB.

How Are Screen Failures Captured in the TS Domain?

Screen failures are accounted for separately from ACTSUB in most cases. For instance, the TS domain may contain a different variable like TSPARMCD=SCRSUB, which captures the number of subjects who were screened. This would include those who did not pass the screening process.

Example Scenario: ACTSUB and Screen Failures

Let’s consider a hypothetical trial with 200 subjects:

  • 250 subjects were screened.
  • 50 of those subjects were screen failures (they did not meet eligibility criteria).
  • The remaining 200 subjects were enrolled in the trial and participated in the treatment.

In this scenario, TSPARMCD=ACTSUB would be recorded as 200, while TSPARMCD=SCRSUB would be recorded as 250 to include all screened subjects, both successful and failures.

References and Guidelines

To further explore this topic, you can review the following references:

Understanding EC vs. EX Domains in SDTM: When to Use Each

Understanding EC vs. EX Domains in SDTM: When to Use Each

In SDTM, the EC (Exposure as Collected) and EX (Exposure) domains are both used to capture data related to drug or therapy exposure, but they serve different purposes depending on how the exposure data is collected and whether the study is blinded or unblinded.

Key Updates from PharmaSUG Papers:

  • PharmaSUG 2017 Paper DS08 introduces the historical context of the EC domain, which was established in SDTMIG v3.2 to support the EX domain by providing detailed traceability for exposure data. EC helps capture deviations, titrations, and other variations from planned dosing, especially when the collected data doesn't match protocol-specified dosing.
  • PharmaSUG 2022 Paper DS121 emphasizes the importance of capturing dose modifications using the EC domain, which often occurs in oncology trials. By utilizing EC, sponsors can accurately document variations such as dose holds, eliminations, and reductions, which later assist in deriving the EX domain.
  • PharmaSUG 2018 Paper DS16 discusses the challenges in blinded studies, highlighting that the EC domain can be used to store blinded data until the study is unblinded, after which the EX domain can be derived. This paper also details the use of EC to capture missed doses that cannot be represented in EX.
  • PharmaSUG China 2021 Paper DS083 provides a detailed discussion of how to present exposure data in a reviewer-friendly manner. It also introduces two new domains from SDTMIG v3.3 — AG (Procedure Agents) and ML (Meals) — which, though not directly related to EC/EX, offer additional context for studies that involve substances administered during procedures or nutritional intake.
---

When to Use the EC and EX Domains

EC (Exposure as Collected) Domain:

  • Use EC when dose modifications such as elimination, hold, delay, reduction, or mid-cycle adjustments are expected due to treatment-related factors (e.g., toxicities in oncology trials).
  • EC is suitable for blinded studies to store collected exposure information until unblinding.
  • EC captures exact details such as missed doses, variations from protocol-specified units, and planned or scheduled exposure using the `ECMOOD` variable.

EX (Exposure) Domain:

  • Use EX to represent planned exposure that aligns with the study protocol. This includes the administration of investigational products in protocol-specified units.
  • EX captures the actual administered dose after unblinding and can also reflect doses of placebos in clinical trials.
---

Key Takeaways for EC and EX Domain Usage

  • Blinded Studies: EC can capture blinded doses, and once unblinded, the EX domain should reflect the actual doses administered.
  • Dose Modifications: EC captures any variations from planned dosing, including dose holds, eliminations, and adjustments.
  • Missed Doses: Use EC to document missed doses and the reasons for those missed doses using `ECOCCUR` and `SUPPEC` for reasons like adverse events.
  • Protocol-Specified Units: EC can capture doses in collected units (e.g., volume of a dosing solution), while EX converts them into protocol-specified units (e.g., mg/kg).
---

Introduction to EC and EX Domains

The EC domain captures the exact exposure data as it is collected in the study. This is especially useful when exposure data varies between subjects, such as in cases of dose titrations, interruptions, or other adjustments. The key feature of the EC domain is its ability to reflect actual data, making it indispensable in complex trials where the administration schedule doesn’t always follow the protocol exactly.

For instance, if subjects are receiving doses that are adjusted based on their responses or lab results, or if subjects experience dose interruptions, the EC domain should be used to capture this variability. It provides an accurate picture of what really happened, even if the data does not align with the protocol-specified dose.

Example: Titration or Adjusted Dosing Scenario

In a trial where Drug B’s dose is titrated based on a subject's response, one subject might start at 25 mg and increase to 50 mg after 10 days. Another subject could remain at 25 mg due to adverse events, and a third subject might increase to 75 mg. These variations should be captured in the EC domain.

STUDYID USUBJID ECDOSE ECDOSU ECDOSFRM ECSTDTC ECENDTC ECREASND
ABC123 001 25 mg Tablet 2024-01-01 2024-01-10 Titration
ABC123 001 50 mg Tablet 2024-01-11 2024-01-14
ABC123 002 25 mg Tablet 2024-01-01 2024-01-15 Adverse Event

When to Use the EX Domain

The EX domain captures the planned exposure based on the study protocol. It is used when the actual exposure follows the protocol as intended. The EX domain should be used for trials where the dosing regimen is straightforward and subjects receive the planned doses at scheduled times.

For example, if a trial protocol specifies that subjects receive 50 mg of Drug A daily for 30 days, and all subjects follow this schedule without any variations, the EX domain can capture this data.

Example: Simple Dosing Scenario

In a study where Drug A is administered in a fixed dose of 50 mg daily, the EX domain captures the planned exposure:

STUDYID USUBJID EXTRT EXDOSE EXDOSU EXROUTE EXSTDTC
XYZ456 001 Drug A 50 mg Oral 2024-02-01
XYZ456 002 Drug A 50 mg Oral 2024-02-01

Using Both EC and EX Domains Together

In some cases, both domains can be used together to represent the planned vs. actual exposure. For instance, the EX domain captures the protocol-specified dose (e.g., 50 mg daily), while the EC domain captures deviations, such as dose interruptions or adjustments. This approach provides a complete picture of the exposure.

Example: Combined Use of EC and EX Domains

In a study where Drug D is administered as 50 mg daily but a subject misses doses due to personal reasons, the EX domain would capture the planned regimen, while the EC domain would record the missed doses.

EX Domain (Planned Dose):
STUDYID USUBJID EXTRT EXDOSE EXDOSU EXROUTE EXSTDTC
DEF789 001 Drug D 50 mg Oral 2024-03-01
EC Domain (Actual Doses with Missed Doses):
STUDYID USUBJID ECDOSE ECDOSU ECDOSFRM ECSTDTC ECENDTC ECREASND
DEF789 001 50 mg Tablet 2024-03-01 2024-03-05
DEF789 001 50 mg Tablet 2024-03-07 2024-03-30 Missed Dose

Additional Considerations for Submission

  • Do not duplicate EC and EX data unless necessary.
  • Use SUPPEC to provide additional reasons for missed or not-given doses in the EC domain.
  • Ensure proper representation of blinded and unblinded data in EC and EX to comply with regulatory expectations.
---

Conclusion

By leveraging the **EC** and **EX** domains appropriately, sponsors can ensure clear traceability of exposure data and provide regulatory reviewers with a complete and accurate story of how subjects were exposed to the study treatment. These domains, when used in tandem, help differentiate between collected and derived data, making it easier for reviewers to assess and understand study results.

Study Start Date in SDTM – Why Getting It Right Matters

Study Start Date in SDTM – Why Getting It Right Matters

The Study Start Date (SSTDTC) is a crucial element in the submission of clinical trial data, especially in meeting regulatory requirements. Since December 2014, the FDA has provided explicit guidance on defining and utilizing this data point, but many sponsors and service providers face challenges in its consistent application. Missteps in defining the Study Start Date can lead to technical rejection during submission reviews, delaying the regulatory process. This article explores the definition, importance, and proper implementation of the Study Start Date in SDTM (Study Data Tabulation Model) submissions, based on regulatory guidance and best practices.

FDA’s Definition of Study Start Date

The FDA, in its 2014 guidance, clarified that the Study Start Date for clinical trials is the earliest date of informed consent for any subject enrolled in the study. This approach ensures a data-driven, verifiable measure, using clinical trial data captured in the Demographics (DM) domain of SDTM. By anchoring the Study Start Date to the earliest informed consent, the FDA avoids manipulation of study timelines and enforces adherence to accepted data standards during study execution.[1]

Why is the Study Start Date Important?

The Study Start Date serves as a cornerstone in the clinical trial submission process. It plays a critical role in:

  • Compliance: The Study Start Date is used as a benchmark to assess the trial’s adherence to FDA’s data standards catalog. Standards implementation is tracked based on this reference date.
  • Consistency: A well-defined Study Start Date ensures uniformity across various study data elements, including SDTM domains, analysis datasets, and associated regulatory documentation.
  • Avoiding Rejections: Incorrect or inconsistent assignment of the Study Start Date can lead to technical rejection by the FDA during submission. Rule 1734, for instance, mandates that the Study Start Date be recorded in the Trial Summary (TS) domain, and failure to meet this criterion can result in a rejection of the submission package.[1]

Where to Record the Study Start Date

Accurate recording of the Study Start Date is essential for successful regulatory submissions. This date must appear in two key sections of the eCTD (electronic Common Technical Document) that are submitted to the FDA:

  1. SDTM TS Domain: The Study Start Date is stored in the variable TS.TSVAL where TSPARMCD = 'SSTDTC'. It must adhere to the ISO 8601 date format (YYYY-MM-DD).
  2. Study Data Standardization Plan (SDSP): The Study Start Date is also included in the SDSP, which contains metadata about the study, along with other relevant details such as the use of FDA-endorsed data standards.[1]

Challenges in Defining the Study Start Date

One major challenge in defining the Study Start Date arises from the varied interpretations among stakeholders. For example:

  • Data Managers may consider the go-live date of the data collection system as the Study Start Date.
  • Safety Teams might prioritize the first dose date of the investigational product.
  • Clinical Operations could focus on the date of the first patient visit at the clinic.

However, the correct interpretation, as per FDA guidance, is the earliest informed consent date, ensuring a consistent and regulatory-compliant approach.[1]

How to Verify the Study Start Date

Verifying the Study Start Date requires careful examination of clinical data. The following steps can help:

  1. Examine the DM Domain: The DM.RFICDTC variable (date of informed consent) is cross-checked against the protocol and Statistical Analysis Plan (SAP) to ensure it aligns with enrolled subjects.
  2. Exclude Screen Failures: Screen failures must be excluded from the analysis as they do not contribute to the Study Start Date. Only enrolled subjects should be included in this determination.
  3. Programmatic Check: The following SAS code can be used to programmatically select the earliest informed consent date for the enrolled subjects:
    proc sql noprint;
       select min(rficdtc) from SDTM.DM
       where rficdtc is not null and armcd in ('TRTA', 'TRTB');
    quit;
    
                

Global Considerations

While the FDA’s definition is clear, other regulatory authorities such as the PMDA in Japan and the NMPA in China have slightly different approaches. For example, the PMDA evaluates standards based on the submission date rather than the Study Start Date. As more global regulators adopt machine-readable clinical data standards, alignment with FDA guidance may serve as a reference point for future harmonization efforts.[1]

Multiple Ways to Derive Stusy Start Date (SSTDTC) in SDTM

Several SAS-based methods can be used to derive the Study Start Date (SSTDTC) in SDTM. Below are a few approaches:

1. Using PROC SQL

PROC SQL can compute the minimum date directly from the dataset:

proc sql noprint;
    select min(rficdtc) into :mindate
    from SDTM.DM
    where rficdtc is not null and armcd ne 'SCRNFAIL';
quit;

    

2. Using DATA STEP with RETAIN

This approach retains the minimum date as the dataset is processed:

data earliest_consent;
    set SDTM.DM;
    where rficdtc is not missing and armcd ne 'SCRNFAIL';
    retain min_consent_date;
    if _N_ = 1 then min_consent_date = rficdtc;
    else if rficdtc < min_consent_date then min_consent_date = rficdtc;
run;

proc print data=earliest_consent(obs=1);
    var min_consent_date;
run;

    

3. Using PROC MEANS

PROC MEANS can be used to calculate the minimum date:

proc means data=SDTM.DM min noprint;
    var rficdtc;
    output out=min_consent_date(drop=_type_ _freq_) min=MinRFICDTC;
run;

proc print data=min_consent_date;
run;

    

4. Using PROC SORT and DATA STEP

This approach involves sorting the dataset and extracting the first record:

proc sort data=SDTM.DM out=sorted_dm;
    by rficdtc;
    where rficdtc is not missing and armcd ne 'SCRNFAIL';
run;

data min_consent_date;
    set sorted_dm;
    if _N_ = 1 then output;
run;

proc print data=min_consent_date;
    var rficdtc;
run;

    

Conclusion

Accurately defining and recording the Study Start Date is essential for compliance with FDA regulations and avoiding submission rejections. Ensuring consistency across study data and metadata is key to a successful clinical trial submission process. Various methods in SAS, including SQL-based and procedural approaches, offer flexible options for deriving the study start date (SSTDTC) in SDTM.

References

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

When working with large datasets, it's common to add new columns from another table to an existing table using SQL. However, many programmers encounter the challenge of recursive referencing in PROC SQL when attempting to create a new table that references itself. This blog post discusses the best practices for adding columns to an existing table using PROC SQL and provides alternative methods that avoid inefficiencies.

1. The Common Approach and Its Pitfall

Here's a simplified example of a common approach to adding columns via a LEFT JOIN:

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

While this approach might seem straightforward, it leads to a warning: "CREATE TABLE statement recursively references the target table". This happens because you're trying to reference the main_table both as the source and the target table in the same query. Furthermore, if you're dealing with large datasets, creating a new table might take up too much server space.

2. Best Practice 1: Use a Temporary Table

A better approach is to use a temporary table to store the joined result and then replace the original table. Here’s how you can implement this:

PROC SQL;
   CREATE TABLE work.temp_table AS 
   SELECT main.*, a.newcol1, a.newcol2
   FROM WORK.main_table main
   LEFT JOIN WORK.addl_data a
   ON main.id = a.id;
QUIT;

PROC SQL;
   DROP TABLE work.main_table;
   CREATE TABLE work.main_table AS 
   SELECT * FROM work.temp_table;
QUIT;

PROC SQL;
   DROP TABLE work.temp_table;
QUIT;

This ensures that the original table is updated with the new columns without violating the recursive referencing rule. It also minimizes space usage since the temp_table will be dropped after the operation.

3. Best Practice 2: Use ALTER TABLE for Adding Columns

If you're simply adding new columns without updating existing ones, you can use the ALTER TABLE statement in combination with UPDATE to populate the new columns:

PROC SQL;
   ALTER TABLE WORK.main_table
   ADD newcol1 NUM, 
       newcol2 NUM;

   UPDATE WORK.main_table main
   SET newcol1 = (SELECT a.newcol1 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id),
       newcol2 = (SELECT a.newcol2 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id);
QUIT;

This approach avoids creating a new table altogether, and instead modifies the structure of the existing table.

4. Best Practice 3: Consider the DATA Step MERGE for Large Datasets

For very large datasets, a DATA step MERGE can sometimes be more efficient than PROC SQL. The MERGE statement allows you to combine datasets based on a common key variable, as shown below:

PROC SORT DATA=WORK.main_table; BY id; RUN;
PROC SORT DATA=WORK.addl_data; BY id; RUN;

DATA WORK.main_table;
   MERGE WORK.main_table (IN=in1)
         WORK.addl_data (IN=in2);
   BY id;
   IF in1; /* Keep only records from main_table */
RUN;

While some might find the MERGE approach less intuitive, it can be a powerful tool for handling large tables when combined with proper sorting of the datasets.

Conclusion

The best method for joining additional columns into an existing table depends on your specific needs, including dataset size and available server space. Using a temporary table or the ALTER TABLE method can be more efficient in certain situations, while the DATA step MERGE is a reliable fallback for large datasets.

By following these best practices, you can avoid common pitfalls and improve the performance of your SQL queries in SAS.

Comprehensive SDTM Review

Mastering the SDTM Review Process: Comprehensive Insights with Real-World Examples

The process of ensuring compliance with Study Data Tabulation Model (SDTM) standards can be challenging due to the diverse requirements and guidelines that span across multiple sources. These include the SDTM Implementation Guide (SDTMIG), the domain-specific assumptions sections, and the FDA Study Data Technical Conformance Guide. While automated tools like Pinnacle 21 play a critical role in detecting many issues, they have limitations. This article provides an in-depth guide to conducting a thorough SDTM review, enhanced by real-world examples that highlight commonly observed pitfalls and solutions.

1. Understanding the Complexity of SDTM Review

One of the first challenges in SDTM review is recognizing that SDTM requirements are spread across different guidelines and manuals. Each source offers a unique perspective on compliance:

  • SDTMIG domain specifications: Provide detailed variable-level specifications.
  • SDTMIG domain assumptions: Offer clarifications for how variables should be populated.
  • FDA Study Data Technical Conformance Guide: Adds regulatory requirements for submitting SDTM data to health authorities.

Real-World Example: Misinterpreting Domain Assumptions

In a multi-site oncology trial, a programmer misunderstood the domain assumptions for the "Events" domains (such as AE – Adverse Events). The SDTMIG advises that adverse events should be reported based on their actual date of occurrence, but the programmer initially used the visit date, leading to incorrect representation of events.

2. Leveraging Pinnacle 21: What It Catches and What It Misses

Pinnacle 21 is a powerful tool for validating SDTM datasets, but it has limitations:

  • What it catches: Missing mandatory variables, incorrect metadata, and value-level issues (non-conformant values).
  • What it misses: Study-specific variables that should be excluded, domain-specific assumptions that must be manually reviewed.

Real-World Example: Inapplicable Variables Passing Pinnacle 21

In a dermatology study, the variable ARM (Treatment Arm) was populated for all subjects, including those in an observational cohort. Since observational subjects did not receive a treatment, this variable should have been blank. Pinnacle 21 didn’t flag this, but a manual review revealed the issue.

3. Key Findings in the Review Process

3.1 General Findings

  • Incorrect Population of Date Variables: Properly populating start and end dates (--STDTC, --ENDTC) is challenging.
  • Missing SUPPQUAL Links: Incomplete or incorrect links between parent domains and SUPPQUAL can lead to misinterpretation.

Real-World Example: Incorrect Dates in a Global Trial

In a global cardiology trial, visit start dates were incorrectly populated due to time zone differences between sites in the U.S. and Europe. A manual review of the date variables identified these inconsistencies and corrected them.

3.2 Domain-Specific Findings

  • Incorrect Usage of Age Units (AGEU): Misuse of AGEU in pediatric studies can lead to incorrect data representation.
  • Inconsistent Use of Controlled Terminology: Discrepancies in controlled terminology like MedDRA or WHO Drug Dictionary can cause significant issues.

Real-World Example: Incorrect AGEU in a Pediatric Study

In a pediatric vaccine trial, the AGEU variable was incorrectly populated with "YEARS" for infants under one year old, when it should have been "MONTHS." This was not flagged by Pinnacle 21 but was discovered during manual review.

4. Optimizing the SDTM Review Process

To conduct an effective SDTM review, follow these steps:

  • Review SDTM Specifications Early: Identify potential issues before SDTM datasets are created.
  • Analyze Pinnacle 21 Reports Critically: Don’t rely solely on automated checks—investigate warnings and study-specific variables manually.
  • Manual Domain Review: Ensure assumptions are met and variables are used correctly in specific domains.

5. Conclusion: Building a Holistic SDTM Review Process

By combining early manual review, critical analysis of automated checks, and a detailed review of domain-specific assumptions, programmers can significantly enhance the accuracy and compliance of SDTM datasets. The real-world examples provided highlight how even small errors can lead to significant downstream problems. A holistic SDTM review process not only saves time but also ensures higher data quality and compliance during regulatory submission.

"""
Revolutionizing SDTM Programming in Pharma with ChatGPT

Revolutionizing SDTM Programming in Pharma with ChatGPT

By Sarath

Introduction

In the pharmaceutical industry, standardizing clinical trial data through Study Data Tabulation Model (SDTM) programming is a critical task. The introduction of AI tools like ChatGPT has opened new opportunities for automating and enhancing the efficiency of SDTM programming. In this article, we will explore how ChatGPT can assist programmers in various SDTM-related tasks, from mapping datasets to performing quality checks, ultimately improving productivity and accuracy.

What is SDTM?

SDTM is a model created by the Clinical Data Interchange Standards Consortium (CDISC) to standardize the structure and format of clinical trial data. This model helps in organizing data for submission to regulatory bodies such as the FDA. SDTM programming involves mapping clinical trial datasets to SDTM-compliant formats, ensuring data quality, and validating that the data follows CDISC guidelines.

How ChatGPT Can Enhance SDTM Programming

ChatGPT can be a game-changer in SDTM programming, providing real-time support, automation, and solutions for common challenges. Here’s how it can be applied in various stages of the SDTM process:

  • Assisting with Mapping Complex Datasets: ChatGPT can provide real-time guidance and suggestions for mapping non-standard datasets to SDTM domains, helping programmers to ensure compliance with CDISC guidelines.
  • Generating Efficient SAS Code: ChatGPT can generate optimized SAS code for common SDTM tasks, such as transforming raw datasets, handling missing data, or applying complex business rules to ensure the data meets the regulatory standards.
  • Debugging SAS Code: ChatGPT can assist in identifying bugs, suggesting ways to debug code, and improving code readability with useful tips like employing the PUTLOG statement.
  • Automating Quality Control Checks: Performing quality checks on large datasets is essential in SDTM programming. ChatGPT can automate parts of this process by generating code for missing variable checks, duplicate observations removal, and ensuring that domain-specific rules are followed.
  • Improving Code Readability: By suggesting best practices for writing clear and maintainable SAS code, ChatGPT can help reduce technical debt and make the code easier to review and debug, especially in collaborative settings.
  • Providing Learning Support for New Programmers: For beginners in SDTM programming, ChatGPT can explain complex concepts in simple terms, provide examples, and offer real-time solutions to questions related to SDTM domains, controlled terminology, and regulatory requirements.

Practical Use Cases for ChatGPT in SDTM Programming

Let's look at a few examples where ChatGPT can offer tangible benefits in SDTM programming:

  • Handling the Demographics Domain (DM): ChatGPT can guide programmers through mapping raw datasets to the SDTM DM domain, offering suggestions for handling specific data types like SUBJID, AGE, and SEX. It can also generate SAS code that adheres to CDISC standards and offers tips for validating the resulting data.
  • Generating Define.XML Files: Defining metadata is critical for regulatory submission. ChatGPT can assist by generating SAS code for creating and validating Define.XML files using tools like Pinnacle 21, ensuring compliance with regulatory expectations.
  • Managing Controlled Terminology: Keeping up with the latest controlled terminology versions (e.g., MedDRA, SNOMED, UNII) is essential. ChatGPT can suggest updates for domain-specific controlled terminology and provide SAS code to automate its application in SDTM datasets.

Limitations and Future Potential

While ChatGPT offers significant advantages, there are still some limitations. For instance, it lacks deep integration with SAS or Pinnacle 21, which means that users need to manually adapt ChatGPT’s suggestions to their specific environments. However, the future potential for ChatGPT to evolve into an even more intelligent assistant is immense. As AI technology advances, ChatGPT could become an essential tool for real-time error detection, domain mapping, and automating SDTM processes end-to-end.

Conclusion

ChatGPT has the potential to transform the way SDTM programming is done in the pharmaceutical industry. From guiding new programmers to automating repetitive tasks and assisting with complex coding challenges, this AI tool can significantly improve the efficiency and accuracy of SDTM workflows. As we continue to explore the capabilities of AI, the integration of tools like ChatGPT into programming environments will become an increasingly vital asset for organizations looking to streamline their clinical data management and regulatory submission processes.

Published by Sarath on [Insert Date]

Unleashing the Power of PROC DATASETS in SAS

Unleashing the Power of PROC DATASETS in SAS

The PROC DATASETS procedure is a versatile and efficient tool within SAS for managing datasets. Often described as the "Swiss Army Knife" of SAS procedures, it allows users to perform a variety of tasks such as renaming, deleting, modifying attributes, appending datasets, and much more, all while consuming fewer system resources compared to traditional data steps. In this article, we’ll explore key use cases, functionality, and examples of PROC DATASETS, illustrating why it should be part of every SAS programmer's toolkit.

1. Why Use PROC DATASETS?

Unlike procedures like PROC APPEND, PROC CONTENTS, and PROC COPY, which focus on specific tasks, PROC DATASETS integrates the functionalities of these procedures and more. By using PROC DATASETS, you avoid the need for multiple procedures, saving both time and system resources since it only updates metadata instead of reading and rewriting the entire dataset.

2. Basic Syntax of PROC DATASETS

The basic structure of PROC DATASETS is as follows:

PROC DATASETS LIBRARY=;
    ;
RUN; QUIT;

Here, you specify the library containing the datasets you want to modify. Commands such as CHANGE, DELETE, APPEND, MODIFY, and RENAME follow within the procedure.

3. Use Case 1: Renaming Datasets and Variables

Renaming datasets and variables is a simple yet powerful capability of PROC DATASETS. Here's an example of how you can rename a dataset:

PROC DATASETS LIBRARY=mylib;
    CHANGE old_data=new_data;
RUN; QUIT;

To rename a variable within a dataset:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    RENAME old_var=new_var;
RUN; QUIT;

4. Use Case 2: Appending Datasets

The APPEND statement is a highly efficient alternative to using SET in a data step because it only reads the dataset being appended (the DATA= dataset), instead of reading both datasets.

PROC DATASETS LIBRARY=mylib;
    APPEND BASE=master_data DATA=new_data;
RUN; QUIT;

5. Use Case 3: Deleting Datasets

Deleting datasets or members within a library is simple with PROC DATASETS. You can delete individual datasets or use the KILL option to remove all members of a library:

PROC DATASETS LIBRARY=mylib;
    DELETE dataset_name;
RUN; QUIT;
PROC DATASETS LIBRARY=mylib KILL;
RUN; QUIT;

6. Use Case 4: Modifying Attributes

You can modify variable attributes such as labels, formats, and informats without rewriting the entire dataset:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    LABEL var_name='New Label';
    FORMAT var_name 8.2;
RUN; QUIT;

7. Advanced Operations with PROC DATASETS

7.1. Working with Audit Trails

You can use PROC DATASETS to manage audit trails, which track changes made to datasets. For instance, the following code creates an audit trail for a dataset:

PROC DATASETS LIBRARY=mylib;
    AUDIT dataset_name;
    INITIATE;
RUN; QUIT;

7.2. Managing Indexes

Indexes help retrieve subsets of data efficiently. You can create or delete indexes with PROC DATASETS:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    INDEX CREATE var_name;
RUN; QUIT;

7.3. Cascading File Renaming with the AGE Command

Another useful feature is the AGE command, which renames a set of files in sequence:

PROC DATASETS LIBRARY=mylib;
    AGE file1-file5;
RUN; QUIT;

Checking If a SAS Dataset is Sorted Using PROC DATASETS

In SAS, datasets often need to be sorted to facilitate various analytical operations. Sorting ensures that records are organized based on one or more variables. However, it’s important to know whether a dataset is already sorted before performing time-consuming operations like PROC SORT. Fortunately, SAS provides an efficient way to check whether a dataset is sorted by using the PROC DATASETS procedure.

Why Use PROC DATASETS to Check Sort Status?

PROC DATASETS is a powerful procedure that can manage and inspect datasets. It allows you to view metadata, including the SORTEDBY attribute, which tells you if the dataset has been sorted and by which variables. This method is faster and more efficient than unnecessarily re-sorting a dataset.

Step-by-Step Example

Let’s walk through an example where we use PROC DATASETS to check whether a dataset is sorted.

Sample SAS Code


/* Step 1: Use PROC DATASETS to inspect the dataset's metadata */
proc datasets lib=work nolist;
  contents data=your_dataset out=sorted_info(keep=name sortedby);
run;
quit;

/* Step 2: Print the output to see the SORTEDBY variable */
proc print data=sorted_info;
run;
    

Code Explanation

  • proc datasets lib=work nolist; - Specifies the library (in this case, WORK) and suppresses the list of files using the NOLIST option.
  • contents data=your_dataset out=sorted_info(keep=name sortedby); - Extracts the metadata for your_dataset and outputs the SORTEDBY information to a dataset named sorted_info.
  • proc print data=sorted_info; - Prints the dataset to view the SORTEDBY information.

Interpreting the Output

The output dataset sorted_info will contain the following columns:

  • Name: The name of the dataset (in this case, your_dataset).
  • SortedBy: A list of the variables by which the dataset is sorted. If this field is empty, it means the dataset is not sorted.

Example Output

Name SortedBy
your_dataset var1 var2

In this case, your_dataset is sorted by the variables var1 and var2. If the SortedBy column is empty, it indicates that the dataset is not sorted.

Handling Multiple Datasets

If you need to check multiple datasets in a library, you can modify the PROC DATASETS step to inspect all datasets without specifying a particular dataset.


proc datasets lib=work nolist;
  contents out=sorted_info(keep=name sortedby);
run;
quit;

/* Print the sorted_info dataset */
proc print data=sorted_info;
run;
    
Note: The SORTEDBY attribute is only updated when a dataset is sorted using PROC SORT. If variables are added after sorting, or the dataset wasn't sorted explicitly, this attribute might not reflect the current sorting status.

Conclusion

PROC DATASETS is an indispensable tool for SAS programmers. Its efficiency and versatility allow you to manage datasets with ease, from renaming and deleting to appending and modifying attributes. By leveraging its full potential, you can streamline your SAS workflows and significantly reduce processing times.

Sources

Disclosure:

In the spirit of transparency and innovation, I want to share that some of the content on this blog is generated with the assistance of ChatGPT, an AI language model developed by OpenAI. While I use this tool to help brainstorm ideas and draft content, every post is carefully reviewed, edited, and personalized by me to ensure it aligns with my voice, values, and the needs of my readers. My goal is to provide you with accurate, valuable, and engaging content, and I believe that using AI as a creative aid helps achieve that. If you have any questions or feedback about this approach, feel free to reach out. Your trust and satisfaction are my top priorities.