StudySAS Blog: Mastering Clinical Data Management with SAS

Welcome to StudySAS, your ultimate guide to clinical data management using SAS. We cover essential topics like SDTM, CDISC standards, and Define.XML, alongside advanced PROC SQL and SAS Macros techniques. Whether you're enhancing your programming efficiency or ensuring compliance with industry standards, StudySAS offers practical tips and insights to elevate your clinical research expertise. Join us and stay ahead in the evolving world of clinical data.

Understanding EC vs. EX Domains in SDTM: When to Use Each

In SDTM, the EC (Exposure as Collected) and EX (Exposure) domains are both used to capture data related to drug or therapy exposure, but they serve different purposes depending on how the exposure data is collected and whether the study is blinded or unblinded.

Key Updates from PharmaSUG Papers:

PharmaSUG 2017 Paper DS08 introduces the historical context of the EC domain, which was established in SDTMIG v3.2 to support the EX domain by providing detailed traceability for exposure data. EC helps capture deviations, titrations, and other variations from planned dosing, especially when the collected data doesn't match protocol-specified dosing.
PharmaSUG 2022 Paper DS121 emphasizes the importance of capturing dose modifications using the EC domain, which often occurs in oncology trials. By utilizing EC, sponsors can accurately document variations such as dose holds, eliminations, and reductions, which later assist in deriving the EX domain.
PharmaSUG 2018 Paper DS16 discusses the challenges in blinded studies, highlighting that the EC domain can be used to store blinded data until the study is unblinded, after which the EX domain can be derived. This paper also details the use of EC to capture missed doses that cannot be represented in EX.
PharmaSUG China 2021 Paper DS083 provides a detailed discussion of how to present exposure data in a reviewer-friendly manner. It also introduces two new domains from SDTMIG v3.3 — AG (Procedure Agents) and ML (Meals) — which, though not directly related to EC/EX, offer additional context for studies that involve substances administered during procedures or nutritional intake.

---

When to Use the EC and EX Domains

EC (Exposure as Collected) Domain:

Use EC when dose modifications such as elimination, hold, delay, reduction, or mid-cycle adjustments are expected due to treatment-related factors (e.g., toxicities in oncology trials).
EC is suitable for blinded studies to store collected exposure information until unblinding.
EC captures exact details such as missed doses, variations from protocol-specified units, and planned or scheduled exposure using the `ECMOOD` variable.

EX (Exposure) Domain:

Use EX to represent planned exposure that aligns with the study protocol. This includes the administration of investigational products in protocol-specified units.
EX captures the actual administered dose after unblinding and can also reflect doses of placebos in clinical trials.

---

Key Takeaways for EC and EX Domain Usage

Blinded Studies: EC can capture blinded doses, and once unblinded, the EX domain should reflect the actual doses administered.
Dose Modifications: EC captures any variations from planned dosing, including dose holds, eliminations, and adjustments.
Missed Doses: Use EC to document missed doses and the reasons for those missed doses using `ECOCCUR` and `SUPPEC` for reasons like adverse events.
Protocol-Specified Units: EC can capture doses in collected units (e.g., volume of a dosing solution), while EX converts them into protocol-specified units (e.g., mg/kg).

---

Introduction to EC and EX Domains

The EC domain captures the exact exposure data as it is collected in the study. This is especially useful when exposure data varies between subjects, such as in cases of dose titrations, interruptions, or other adjustments. The key feature of the EC domain is its ability to reflect actual data, making it indispensable in complex trials where the administration schedule doesn’t always follow the protocol exactly.

For instance, if subjects are receiving doses that are adjusted based on their responses or lab results, or if subjects experience dose interruptions, the EC domain should be used to capture this variability. It provides an accurate picture of what really happened, even if the data does not align with the protocol-specified dose.

Example: Titration or Adjusted Dosing Scenario

In a trial where Drug B’s dose is titrated based on a subject's response, one subject might start at 25 mg and increase to 50 mg after 10 days. Another subject could remain at 25 mg due to adverse events, and a third subject might increase to 75 mg. These variations should be captured in the EC domain.

STUDYID	USUBJID	ECDOSE	ECDOSU	ECDOSFRM	ECSTDTC	ECENDTC	ECREASND
ABC123	001	25	mg	Tablet	2024-01-01	2024-01-10	Titration
ABC123	001	50	mg	Tablet	2024-01-11	2024-01-14
ABC123	002	25	mg	Tablet	2024-01-01	2024-01-15	Adverse Event

When to Use the EX Domain

The EX domain captures the planned exposure based on the study protocol. It is used when the actual exposure follows the protocol as intended. The EX domain should be used for trials where the dosing regimen is straightforward and subjects receive the planned doses at scheduled times.

For example, if a trial protocol specifies that subjects receive 50 mg of Drug A daily for 30 days, and all subjects follow this schedule without any variations, the EX domain can capture this data.

Example: Simple Dosing Scenario

In a study where Drug A is administered in a fixed dose of 50 mg daily, the EX domain captures the planned exposure:

STUDYID	USUBJID	EXTRT	EXDOSE	EXDOSU	EXROUTE	EXSTDTC
XYZ456	001	Drug A	50	mg	Oral	2024-02-01
XYZ456	002	Drug A	50	mg	Oral	2024-02-01

Using Both EC and EX Domains Together

In some cases, both domains can be used together to represent the planned vs. actual exposure. For instance, the EX domain captures the protocol-specified dose (e.g., 50 mg daily), while the EC domain captures deviations, such as dose interruptions or adjustments. This approach provides a complete picture of the exposure.

Example: Combined Use of EC and EX Domains

In a study where Drug D is administered as 50 mg daily but a subject misses doses due to personal reasons, the EX domain would capture the planned regimen, while the EC domain would record the missed doses.

EX Domain (Planned Dose):

STUDYID	USUBJID	EXTRT	EXDOSE	EXDOSU	EXROUTE	EXSTDTC
DEF789	001	Drug D	50	mg	Oral	2024-03-01

EC Domain (Actual Doses with Missed Doses):

STUDYID	USUBJID	ECDOSE	ECDOSU	ECDOSFRM	ECSTDTC	ECENDTC	ECREASND
DEF789	001	50	mg	Tablet	2024-03-01	2024-03-05
DEF789	001	50	mg	Tablet	2024-03-07	2024-03-30	Missed Dose

Additional Considerations for Submission

Do not duplicate EC and EX data unless necessary.
Use SUPPEC to provide additional reasons for missed or not-given doses in the EC domain.
Ensure proper representation of blinded and unblinded data in EC and EX to comply with regulatory expectations.

---

Conclusion

By leveraging the **EC** and **EX** domains appropriately, sponsors can ensure clear traceability of exposure data and provide regulatory reviewers with a complete and accurate story of how subjects were exposed to the study treatment. These domains, when used in tandem, help differentiate between collected and derived data, making it easier for reviewers to assess and understand study results.

Study Start Date in SDTM – Why Getting It Right Matters

The Study Start Date (SSTDTC) is a crucial element in the submission of clinical trial data, especially in meeting regulatory requirements. Since December 2014, the FDA has provided explicit guidance on defining and utilizing this data point, but many sponsors and service providers face challenges in its consistent application. Missteps in defining the Study Start Date can lead to technical rejection during submission reviews, delaying the regulatory process. This article explores the definition, importance, and proper implementation of the Study Start Date in SDTM (Study Data Tabulation Model) submissions, based on regulatory guidance and best practices.

FDA’s Definition of Study Start Date

The FDA, in its 2014 guidance, clarified that the Study Start Date for clinical trials is the earliest date of informed consent for any subject enrolled in the study. This approach ensures a data-driven, verifiable measure, using clinical trial data captured in the Demographics (DM) domain of SDTM. By anchoring the Study Start Date to the earliest informed consent, the FDA avoids manipulation of study timelines and enforces adherence to accepted data standards during study execution.^[1]

Why is the Study Start Date Important?

The Study Start Date serves as a cornerstone in the clinical trial submission process. It plays a critical role in:

Compliance: The Study Start Date is used as a benchmark to assess the trial’s adherence to FDA’s data standards catalog. Standards implementation is tracked based on this reference date.
Consistency: A well-defined Study Start Date ensures uniformity across various study data elements, including SDTM domains, analysis datasets, and associated regulatory documentation.
Avoiding Rejections: Incorrect or inconsistent assignment of the Study Start Date can lead to technical rejection by the FDA during submission. Rule 1734, for instance, mandates that the Study Start Date be recorded in the Trial Summary (TS) domain, and failure to meet this criterion can result in a rejection of the submission package.^[1]

Where to Record the Study Start Date

Accurate recording of the Study Start Date is essential for successful regulatory submissions. This date must appear in two key sections of the eCTD (electronic Common Technical Document) that are submitted to the FDA:

SDTM TS Domain: The Study Start Date is stored in the variable TS.TSVAL where TSPARMCD = 'SSTDTC'. It must adhere to the ISO 8601 date format (YYYY-MM-DD).
Study Data Standardization Plan (SDSP): The Study Start Date is also included in the SDSP, which contains metadata about the study, along with other relevant details such as the use of FDA-endorsed data standards.^[1]

Challenges in Defining the Study Start Date

One major challenge in defining the Study Start Date arises from the varied interpretations among stakeholders. For example:

Data Managers may consider the go-live date of the data collection system as the Study Start Date.
Safety Teams might prioritize the first dose date of the investigational product.
Clinical Operations could focus on the date of the first patient visit at the clinic.

However, the correct interpretation, as per FDA guidance, is the earliest informed consent date, ensuring a consistent and regulatory-compliant approach.^[1]

How to Verify the Study Start Date

Verifying the Study Start Date requires careful examination of clinical data. The following steps can help:

Examine the DM Domain: The DM.RFICDTC variable (date of informed consent) is cross-checked against the protocol and Statistical Analysis Plan (SAP) to ensure it aligns with enrolled subjects.
Exclude Screen Failures: Screen failures must be excluded from the analysis as they do not contribute to the Study Start Date. Only enrolled subjects should be included in this determination.

Programmatic Check: The following SAS code can be used to programmatically select the earliest informed consent date for the enrolled subjects:

proc sql noprint;
   select min(rficdtc) from SDTM.DM
   where rficdtc is not null and armcd in ('TRTA', 'TRTB');
quit;

Global Considerations

While the FDA’s definition is clear, other regulatory authorities such as the PMDA in Japan and the NMPA in China have slightly different approaches. For example, the PMDA evaluates standards based on the submission date rather than the Study Start Date. As more global regulators adopt machine-readable clinical data standards, alignment with FDA guidance may serve as a reference point for future harmonization efforts.^[1]

Multiple Ways to Derive Stusy Start Date (SSTDTC) in SDTM

Several SAS-based methods can be used to derive the Study Start Date (SSTDTC) in SDTM. Below are a few approaches:

1. Using PROC SQL

PROC SQL can compute the minimum date directly from the dataset:

proc sql noprint;
    select min(rficdtc) into :mindate
    from SDTM.DM
    where rficdtc is not null and armcd ne 'SCRNFAIL';
quit;

2. Using DATA STEP with RETAIN

This approach retains the minimum date as the dataset is processed:

data earliest_consent;
    set SDTM.DM;
    where rficdtc is not missing and armcd ne 'SCRNFAIL';
    retain min_consent_date;
    if _N_ = 1 then min_consent_date = rficdtc;
    else if rficdtc < min_consent_date then min_consent_date = rficdtc;
run;

proc print data=earliest_consent(obs=1);
    var min_consent_date;
run;

3. Using PROC MEANS

PROC MEANS can be used to calculate the minimum date:

proc means data=SDTM.DM min noprint;
    var rficdtc;
    output out=min_consent_date(drop=_type_ _freq_) min=MinRFICDTC;
run;

proc print data=min_consent_date;
run;

4. Using PROC SORT and DATA STEP

This approach involves sorting the dataset and extracting the first record:

proc sort data=SDTM.DM out=sorted_dm;
    by rficdtc;
    where rficdtc is not missing and armcd ne 'SCRNFAIL';
run;

data min_consent_date;
    set sorted_dm;
    if _N_ = 1 then output;
run;

proc print data=min_consent_date;
    var rficdtc;
run;

Conclusion

Accurately defining and recording the Study Start Date is essential for compliance with FDA regulations and avoiding submission rejections. Ensuring consistency across study data and metadata is key to a successful clinical trial submission process. Various methods in SAS, including SQL-based and procedural approaches, offer flexible options for deriving the study start date (SSTDTC) in SDTM.

References

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

When working with large datasets, it's common to add new columns from another table to an existing table using SQL. However, many programmers encounter the challenge of recursive referencing in PROC SQL when attempting to create a new table that references itself. This blog post discusses the best practices for adding columns to an existing table using PROC SQL and provides alternative methods that avoid inefficiencies.

1. The Common Approach and Its Pitfall

Here's a simplified example of a common approach to adding columns via a LEFT JOIN:

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

While this approach might seem straightforward, it leads to a warning: "CREATE TABLE statement recursively references the target table". This happens because you're trying to reference the main_table both as the source and the target table in the same query. Furthermore, if you're dealing with large datasets, creating a new table might take up too much server space.

2. Best Practice 1: Use a Temporary Table

A better approach is to use a temporary table to store the joined result and then replace the original table. Here’s how you can implement this:

PROC SQL;
   CREATE TABLE work.temp_table AS 
   SELECT main.*, a.newcol1, a.newcol2
   FROM WORK.main_table main
   LEFT JOIN WORK.addl_data a
   ON main.id = a.id;
QUIT;

PROC SQL;
   DROP TABLE work.main_table;
   CREATE TABLE work.main_table AS 
   SELECT * FROM work.temp_table;
QUIT;

PROC SQL;
   DROP TABLE work.temp_table;
QUIT;

This ensures that the original table is updated with the new columns without violating the recursive referencing rule. It also minimizes space usage since the temp_table will be dropped after the operation.

3. Best Practice 2: Use `ALTER TABLE` for Adding Columns

If you're simply adding new columns without updating existing ones, you can use the ALTER TABLE statement in combination with UPDATE to populate the new columns:

PROC SQL;
   ALTER TABLE WORK.main_table
   ADD newcol1 NUM, 
       newcol2 NUM;

   UPDATE WORK.main_table main
   SET newcol1 = (SELECT a.newcol1 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id),
       newcol2 = (SELECT a.newcol2 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id);
QUIT;

This approach avoids creating a new table altogether, and instead modifies the structure of the existing table.

4. Best Practice 3: Consider the `DATA Step MERGE` for Large Datasets

For very large datasets, a DATA step MERGE can sometimes be more efficient than PROC SQL. The MERGE statement allows you to combine datasets based on a common key variable, as shown below:

PROC SORT DATA=WORK.main_table; BY id; RUN;
PROC SORT DATA=WORK.addl_data; BY id; RUN;

DATA WORK.main_table;
   MERGE WORK.main_table (IN=in1)
         WORK.addl_data (IN=in2);
   BY id;
   IF in1; /* Keep only records from main_table */
RUN;

While some might find the MERGE approach less intuitive, it can be a powerful tool for handling large tables when combined with proper sorting of the datasets.

Conclusion

The best method for joining additional columns into an existing table depends on your specific needs, including dataset size and available server space. Using a temporary table or the ALTER TABLE method can be more efficient in certain situations, while the DATA step MERGE is a reliable fallback for large datasets.

By following these best practices, you can avoid common pitfalls and improve the performance of your SQL queries in SAS.

Comprehensive SDTM Review

Mastering the SDTM Review Process: Comprehensive Insights with Real-World Examples

The process of ensuring compliance with Study Data Tabulation Model (SDTM) standards can be challenging due to the diverse requirements and guidelines that span across multiple sources. These include the SDTM Implementation Guide (SDTMIG), the domain-specific assumptions sections, and the FDA Study Data Technical Conformance Guide. While automated tools like Pinnacle 21 play a critical role in detecting many issues, they have limitations. This article provides an in-depth guide to conducting a thorough SDTM review, enhanced by real-world examples that highlight commonly observed pitfalls and solutions.

1. Understanding the Complexity of SDTM Review

One of the first challenges in SDTM review is recognizing that SDTM requirements are spread across different guidelines and manuals. Each source offers a unique perspective on compliance:

SDTMIG domain specifications: Provide detailed variable-level specifications.
SDTMIG domain assumptions: Offer clarifications for how variables should be populated.
FDA Study Data Technical Conformance Guide: Adds regulatory requirements for submitting SDTM data to health authorities.

Real-World Example: Misinterpreting Domain Assumptions

In a multi-site oncology trial, a programmer misunderstood the domain assumptions for the "Events" domains (such as AE – Adverse Events). The SDTMIG advises that adverse events should be reported based on their actual date of occurrence, but the programmer initially used the visit date, leading to incorrect representation of events.

2. Leveraging Pinnacle 21: What It Catches and What It Misses

Pinnacle 21 is a powerful tool for validating SDTM datasets, but it has limitations:

What it catches: Missing mandatory variables, incorrect metadata, and value-level issues (non-conformant values).
What it misses: Study-specific variables that should be excluded, domain-specific assumptions that must be manually reviewed.

Real-World Example: Inapplicable Variables Passing Pinnacle 21

In a dermatology study, the variable ARM (Treatment Arm) was populated for all subjects, including those in an observational cohort. Since observational subjects did not receive a treatment, this variable should have been blank. Pinnacle 21 didn’t flag this, but a manual review revealed the issue.

3. Key Findings in the Review Process

3.1 General Findings

Incorrect Population of Date Variables: Properly populating start and end dates (--STDTC, --ENDTC) is challenging.
Missing SUPPQUAL Links: Incomplete or incorrect links between parent domains and SUPPQUAL can lead to misinterpretation.

Real-World Example: Incorrect Dates in a Global Trial

In a global cardiology trial, visit start dates were incorrectly populated due to time zone differences between sites in the U.S. and Europe. A manual review of the date variables identified these inconsistencies and corrected them.

3.2 Domain-Specific Findings

Incorrect Usage of Age Units (AGEU): Misuse of AGEU in pediatric studies can lead to incorrect data representation.
Inconsistent Use of Controlled Terminology: Discrepancies in controlled terminology like MedDRA or WHO Drug Dictionary can cause significant issues.

Real-World Example: Incorrect AGEU in a Pediatric Study

In a pediatric vaccine trial, the AGEU variable was incorrectly populated with "YEARS" for infants under one year old, when it should have been "MONTHS." This was not flagged by Pinnacle 21 but was discovered during manual review.

4. Optimizing the SDTM Review Process

To conduct an effective SDTM review, follow these steps:

Review SDTM Specifications Early: Identify potential issues before SDTM datasets are created.
Analyze Pinnacle 21 Reports Critically: Don’t rely solely on automated checks—investigate warnings and study-specific variables manually.
Manual Domain Review: Ensure assumptions are met and variables are used correctly in specific domains.

5. Conclusion: Building a Holistic SDTM Review Process

By combining early manual review, critical analysis of automated checks, and a detailed review of domain-specific assumptions, programmers can significantly enhance the accuracy and compliance of SDTM datasets. The real-world examples provided highlight how even small errors can lead to significant downstream problems. A holistic SDTM review process not only saves time but also ensures higher data quality and compliance during regulatory submission.

"""

Revolutionizing SDTM Programming in Pharma with ChatGPT

By Sarath

Introduction

In the pharmaceutical industry, standardizing clinical trial data through Study Data Tabulation Model (SDTM) programming is a critical task. The introduction of AI tools like ChatGPT has opened new opportunities for automating and enhancing the efficiency of SDTM programming. In this article, we will explore how ChatGPT can assist programmers in various SDTM-related tasks, from mapping datasets to performing quality checks, ultimately improving productivity and accuracy.

What is SDTM?

SDTM is a model created by the Clinical Data Interchange Standards Consortium (CDISC) to standardize the structure and format of clinical trial data. This model helps in organizing data for submission to regulatory bodies such as the FDA. SDTM programming involves mapping clinical trial datasets to SDTM-compliant formats, ensuring data quality, and validating that the data follows CDISC guidelines.

How ChatGPT Can Enhance SDTM Programming

ChatGPT can be a game-changer in SDTM programming, providing real-time support, automation, and solutions for common challenges. Here’s how it can be applied in various stages of the SDTM process:

Assisting with Mapping Complex Datasets: ChatGPT can provide real-time guidance and suggestions for mapping non-standard datasets to SDTM domains, helping programmers to ensure compliance with CDISC guidelines.
Generating Efficient SAS Code: ChatGPT can generate optimized SAS code for common SDTM tasks, such as transforming raw datasets, handling missing data, or applying complex business rules to ensure the data meets the regulatory standards.
Debugging SAS Code: ChatGPT can assist in identifying bugs, suggesting ways to debug code, and improving code readability with useful tips like employing the PUTLOG statement.
Automating Quality Control Checks: Performing quality checks on large datasets is essential in SDTM programming. ChatGPT can automate parts of this process by generating code for missing variable checks, duplicate observations removal, and ensuring that domain-specific rules are followed.
Improving Code Readability: By suggesting best practices for writing clear and maintainable SAS code, ChatGPT can help reduce technical debt and make the code easier to review and debug, especially in collaborative settings.
Providing Learning Support for New Programmers: For beginners in SDTM programming, ChatGPT can explain complex concepts in simple terms, provide examples, and offer real-time solutions to questions related to SDTM domains, controlled terminology, and regulatory requirements.

Practical Use Cases for ChatGPT in SDTM Programming

Let's look at a few examples where ChatGPT can offer tangible benefits in SDTM programming:

Handling the Demographics Domain (DM): ChatGPT can guide programmers through mapping raw datasets to the SDTM DM domain, offering suggestions for handling specific data types like SUBJID, AGE, and SEX. It can also generate SAS code that adheres to CDISC standards and offers tips for validating the resulting data.
Generating Define.XML Files: Defining metadata is critical for regulatory submission. ChatGPT can assist by generating SAS code for creating and validating Define.XML files using tools like Pinnacle 21, ensuring compliance with regulatory expectations.
Managing Controlled Terminology: Keeping up with the latest controlled terminology versions (e.g., MedDRA, SNOMED, UNII) is essential. ChatGPT can suggest updates for domain-specific controlled terminology and provide SAS code to automate its application in SDTM datasets.

Limitations and Future Potential

While ChatGPT offers significant advantages, there are still some limitations. For instance, it lacks deep integration with SAS or Pinnacle 21, which means that users need to manually adapt ChatGPT’s suggestions to their specific environments. However, the future potential for ChatGPT to evolve into an even more intelligent assistant is immense. As AI technology advances, ChatGPT could become an essential tool for real-time error detection, domain mapping, and automating SDTM processes end-to-end.

Conclusion

ChatGPT has the potential to transform the way SDTM programming is done in the pharmaceutical industry. From guiding new programmers to automating repetitive tasks and assisting with complex coding challenges, this AI tool can significantly improve the efficiency and accuracy of SDTM workflows. As we continue to explore the capabilities of AI, the integration of tools like ChatGPT into programming environments will become an increasingly vital asset for organizations looking to streamline their clinical data management and regulatory submission processes.

Unleashing the Power of PROC DATASETS in SAS

The PROC DATASETS procedure is a versatile and efficient tool within SAS for managing datasets. Often described as the "Swiss Army Knife" of SAS procedures, it allows users to perform a variety of tasks such as renaming, deleting, modifying attributes, appending datasets, and much more, all while consuming fewer system resources compared to traditional data steps. In this article, we’ll explore key use cases, functionality, and examples of PROC DATASETS, illustrating why it should be part of every SAS programmer's toolkit.

1. Why Use PROC DATASETS?

Unlike procedures like PROC APPEND, PROC CONTENTS, and PROC COPY, which focus on specific tasks, PROC DATASETS integrates the functionalities of these procedures and more. By using PROC DATASETS, you avoid the need for multiple procedures, saving both time and system resources since it only updates metadata instead of reading and rewriting the entire dataset.

2. Basic Syntax of PROC DATASETS

The basic structure of PROC DATASETS is as follows:

PROC DATASETS LIBRARY=;
    ;
RUN; QUIT;

Here, you specify the library containing the datasets you want to modify. Commands such as CHANGE, DELETE, APPEND, MODIFY, and RENAME follow within the procedure.

3. Use Case 1: Renaming Datasets and Variables

Renaming datasets and variables is a simple yet powerful capability of PROC DATASETS. Here's an example of how you can rename a dataset:

PROC DATASETS LIBRARY=mylib;
    CHANGE old_data=new_data;
RUN; QUIT;

To rename a variable within a dataset:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    RENAME old_var=new_var;
RUN; QUIT;

4. Use Case 2: Appending Datasets

The APPEND statement is a highly efficient alternative to using SET in a data step because it only reads the dataset being appended (the DATA= dataset), instead of reading both datasets.

PROC DATASETS LIBRARY=mylib;
    APPEND BASE=master_data DATA=new_data;
RUN; QUIT;

5. Use Case 3: Deleting Datasets

Deleting datasets or members within a library is simple with PROC DATASETS. You can delete individual datasets or use the KILL option to remove all members of a library:

PROC DATASETS LIBRARY=mylib;
    DELETE dataset_name;
RUN; QUIT;

PROC DATASETS LIBRARY=mylib KILL;
RUN; QUIT;

6. Use Case 4: Modifying Attributes

You can modify variable attributes such as labels, formats, and informats without rewriting the entire dataset:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    LABEL var_name='New Label';
    FORMAT var_name 8.2;
RUN; QUIT;

7. Advanced Operations with PROC DATASETS

7.1. Working with Audit Trails

You can use PROC DATASETS to manage audit trails, which track changes made to datasets. For instance, the following code creates an audit trail for a dataset:

PROC DATASETS LIBRARY=mylib;
    AUDIT dataset_name;
    INITIATE;
RUN; QUIT;

7.2. Managing Indexes

Indexes help retrieve subsets of data efficiently. You can create or delete indexes with PROC DATASETS:

PROC DATASETS LIBRARY=mylib;
    MODIFY dataset_name;
    INDEX CREATE var_name;
RUN; QUIT;

7.3. Cascading File Renaming with the AGE Command

Another useful feature is the AGE command, which renames a set of files in sequence:

PROC DATASETS LIBRARY=mylib;
    AGE file1-file5;
RUN; QUIT;

Checking If a SAS Dataset is Sorted Using `PROC DATASETS`

In SAS, datasets often need to be sorted to facilitate various analytical operations. Sorting ensures that records are organized based on one or more variables. However, it’s important to know whether a dataset is already sorted before performing time-consuming operations like PROC SORT. Fortunately, SAS provides an efficient way to check whether a dataset is sorted by using the PROC DATASETS procedure.

Why Use `PROC DATASETS` to Check Sort Status?

PROC DATASETS is a powerful procedure that can manage and inspect datasets. It allows you to view metadata, including the SORTEDBY attribute, which tells you if the dataset has been sorted and by which variables. This method is faster and more efficient than unnecessarily re-sorting a dataset.

Step-by-Step Example

Let’s walk through an example where we use PROC DATASETS to check whether a dataset is sorted.

Sample SAS Code


/* Step 1: Use PROC DATASETS to inspect the dataset's metadata */
proc datasets lib=work nolist;
  contents data=your_dataset out=sorted_info(keep=name sortedby);
run;
quit;

/* Step 2: Print the output to see the SORTEDBY variable */
proc print data=sorted_info;
run;

Code Explanation

proc datasets lib=work nolist; - Specifies the library (in this case, WORK) and suppresses the list of files using the NOLIST option.
contents data=your_dataset out=sorted_info(keep=name sortedby); - Extracts the metadata for your_dataset and outputs the SORTEDBY information to a dataset named sorted_info.
proc print data=sorted_info; - Prints the dataset to view the SORTEDBY information.

Interpreting the Output

The output dataset sorted_info will contain the following columns:

Name: The name of the dataset (in this case, your_dataset).
SortedBy: A list of the variables by which the dataset is sorted. If this field is empty, it means the dataset is not sorted.

Example Output

Name	SortedBy
your_dataset	var1 var2

In this case, your_dataset is sorted by the variables var1 and var2. If the SortedBy column is empty, it indicates that the dataset is not sorted.

Handling Multiple Datasets

If you need to check multiple datasets in a library, you can modify the PROC DATASETS step to inspect all datasets without specifying a particular dataset.


proc datasets lib=work nolist;
  contents out=sorted_info(keep=name sortedby);
run;
quit;

/* Print the sorted_info dataset */
proc print data=sorted_info;
run;

Note: The SORTEDBY attribute is only updated when a dataset is sorted using PROC SORT. If variables are added after sorting, or the dataset wasn't sorted explicitly, this attribute might not reflect the current sorting status.

Conclusion

PROC DATASETS is an indispensable tool for SAS programmers. Its efficiency and versatility allow you to manage datasets with ease, from renaming and deleting to appending and modifying attributes. By leveraging its full potential, you can streamline your SAS workflows and significantly reduce processing times.

Sources

Advanced SDTM Mapping Pitfalls and How to Avoid Them

Introduction

Mapping clinical data to SDTM domains is a complex process involving many technical and logical challenges. For experienced programmers, common issues often revolve around proper handling of controlled terminology, managing derived variables, ensuring consistency between domains, and maintaining relational integrity. This article explores some of the most common SDTM mapping pitfalls, with advanced solutions and SAS code examples, to help avoid regulatory submission errors.

Pitfall 1: Handling Derived Variables Incorrectly

One of the most common issues in SDTM mapping is incorrectly handling derived variables, which can lead to inaccuracies in key datasets such as EX (Exposure) and VS (Vital Signs).

Example:

A derived variable such as EXDUR (Exposure Duration) in the EX domain may not be properly calculated or may fail to align with the duration of the actual treatment.

Solution:

Use a SAS data step to derive exposure duration based on start and end dates (EXSTDTC and EXENDTC), ensuring proper ISO 8601 formatting and alignment with other datasets.


data ex_derived;
    set ex;
    /* Calculate the difference in days between EXSTDTC and EXENDTC */
    if not missing(EXSTDTC) and not missing(EXENDTC) then do;
        EXDUR = intck('day', input(EXSTDTC, yymmdd10.), input(EXENDTC, yymmdd10.));
    end;
    else EXDUR = .;
run;

proc print data=ex_derived;
    title "Derived EXDUR Based on Start and End Dates";
run;

Pitfall 2: Inconsistent Mapping of Medical History (MH) and Adverse Events (AE)

A common challenge is ensuring consistency between the AE (Adverse Events) and MH (Medical History) domains when both involve related clinical data. Programmers often struggle to map similar events across these domains, resulting in duplicated or inconsistent information.

Example:

A medical history event such as "Diabetes" may appear in both the MH and AE domains, but without proper mapping, this can create discrepancies.

Solution:

Create validation checks that ensure consistency between MH and AE, flagging records where conditions in MH appear without an associated AE or where the same event is inconsistently coded.


proc sql;
    create table check_ae_mh as 
    select a.USUBJID, a.MHTERM as Event, b.AETERM as AE_Event
    from mh as a
    left join ae as b
    on a.USUBJID = b.USUBJID and a.MHTERM = b.AETERM
    where missing(b.AETERM);
quit;

proc print data=check_ae_mh;
    title "Medical History Terms Without Corresponding Adverse Events";
run;

Pitfall 3: Issues with Handling Visit Windows

Assigning the correct visit window (such as VISITNUM, VISIT, or VISITDY) is critical in SDTM domains such as VS (Vital Signs), LB (Laboratory Tests), and EG (ECG). A common issue is assigning observations to the wrong visit window when the visit dates don’t match the planned visit windows.

Example:

If a patient visit happens on Day 15, but the expected window is Day 14, misalignment in visit windows can occur, leading to incorrect visit assignment.

Solution:

Create visit windowing rules using advanced date logic, ensuring that observations are correctly assigned to the proper visit window.


data vs_visit_window;
    set vs;
    if abs(VISITDY - planned_day) <= 2 then VISITWINDOW = 'On Time';
    else if VISITDY < planned_day then VISITWINDOW = 'Early';
    else VISITWINDOW = 'Late';
run;

proc print data=vs_visit_window;
    where VISITWINDOW ne 'On Time';
    title "Vital Signs with Early or Late Visits";
run;

Pitfall 4: Incorrect Handling of Controlled Terminology (CT)

Another common issue is failing to properly implement and regularly update controlled terminology for variables such as AEDECOD (Adverse Event Dictionary Term) or CMDECOD (Concomitant Medication Dictionary Term). Regulatory submissions often fail due to outdated or incorrect terminology.

Example:

Incorrectly mapped terms in the Concomitant Medications (CM) domain to outdated controlled terminology (e.g., WHO Drug) can result in a failed submission.

Solution:

Use a dynamic lookup to validate the controlled terms against the latest version of the dictionaries, flagging any mismatches or outdated terms.


proc sql;
    create table cm_term_check as 
    select a.*, b.whodrug_term
    from cm as a
    left join whodrug as b
    on a.CMDECOD = b.whodrug_term
    where missing(b.whodrug_term);
quit;

proc print data=cm_term_check;
    title "Mismatched Terms in Concomitant Medications Domain";
run;

Pitfall 5: Incorrect RELREC (Related Records) Relationships

Establishing correct relationships between records in different domains is essential when creating the RELREC domain. Misaligned or incomplete relationships can cause inconsistencies in your submission and prevent accurate linking between datasets.

Example:

A common issue is incorrectly linking the AE (Adverse Events) domain to the LB (Laboratory Tests) domain, where the related records aren't properly structured.

Solution:

Validate the relational integrity of records between AE and LB using the RDOMAIN and IDVARVAL variables in the RELREC domain.


proc sql;
    create table relrec_check as 
    select a.*, b.LBSEQ
    from relrec as a
    left join lb as b
    on a.USUBJID = b.USUBJID and a.IDVARVAL = b.LBSEQ
    where missing(b.LBSEQ);
quit;

proc print data=relrec_check;
    title "Mismatched RELREC Relationships Between AE and LB Domains";
run;

Pitfall 6: Handling Variables in Supplemental Qualifiers (SUPPxx)

Supplemental qualifiers (SUPPxx) domains are often used for storing non-standard variables, but incorrect or incomplete linking between the main domain and the SUPPxx domain can lead to issues during submission.

Example:

In the AE domain, supplemental qualifiers such as additional severity information may not be properly linked, causing discrepancies.

Solution:

Check for missing or improperly linked records between AE and SUPPAE domains, ensuring that each supplemental qualifier is accurately referenced.


proc sql;
    create table suppae_check as 
    select a.*, b.AESEQ
    from suppae as a
    left join ae as b
    on a.USUBJID = b.USUBJID and a.IDVARVAL = b.AESEQ
    where missing(b.AESEQ);
quit;

proc print data=suppae_check;
    title "Unlinked SUPP Qualifiers in AE Domain";
run;

Conclusion

Addressing these advanced SDTM mapping pitfalls requires a combination of regular validation, correct handling of controlled terminology, and careful management of relationships across datasets. By implementing automated SAS routines, you can streamline the process and ensure compliance with regulatory standards, preventing costly delays in submission.

Advanced SDTM Programming Techniques for SAS Programmers

As an experienced SAS programmer working with the Study Data Tabulation Model (SDTM), it's crucial to stay updated with the latest programming techniques. Whether you're tasked with building SDTM domains from scratch or optimizing existing code, there are several advanced concepts that can improve your workflows and output. In this post, we’ll explore some techniques that can help you overcome common SDTM challenges and boost efficiency in handling clinical trial data.

1. Efficient Handling of Large Datasets

When dealing with large clinical datasets, speed and efficiency are key. One method to optimize SDTM domain generation is to reduce the data footprint by eliminating unnecessary variables and duplicative observations. Consider the following approaches:

Removing Duplicate Observations

Duplicate records can slow down the processing of datasets and cause inaccuracies in reporting. To remove duplicates, you can use the PROC SQL, DATA STEP, or PROC SORT methods. Here's a quick example using PROC SORT:

proc sort data=mydata nodupkey;
    by usubjid visitnum;
run;

This example ensures that only unique records for each usubjid and visitnum combination are retained, eliminating potential redundancy in your dataset.

2. Mastering Macro Variables for Flexibility

Utilizing macro variables efficiently can significantly improve your code's flexibility, especially when working across multiple SDTM domains. Macro variables allow you to automate repetitive tasks and reduce the risk of human error. Here’s an example using macro variables to generate domain-specific reports:

%macro create_sdtm_report(domain);
    data &domain._report;
        set sdtm.&domain.;
        /* Apply domain-specific transformations */
    run;
%mend;

%create_sdtm_report(DM);
%create_sdtm_report(LB);

In this case, the macro dynamically generates SDTM reports for any domain by passing the domain name as a parameter, minimizing manual interventions.

3. Managing Demographics with the DC Domain

The Demographics as Collected (DC) domain often presents unique challenges, particularly when distinguishing it from the standard Demographics (DM) domain. While DM represents standardized data, DC focuses on the raw, collected demographic details. Here's an approach to manage these domains efficiently:

data dc_domain;
    set raw_data;
    /* Capture specific collected demographic data */
    where not missing(collected_age) and not missing(collected_gender);
run;

In this case, the code filters out any missing collected data to ensure the DC domain contains only records with complete demographic information.

4. Debugging SDTM Code with PUTLOG

Efficient debugging is crucial, especially when dealing with complex SDTM transformations. The PUTLOG statement in SAS is a simple yet powerful tool for tracking errors and debugging data issues.

data check;
    set sdtm.dm;
    if missing(usubjid) then putlog "ERROR: Missing USUBJID at obs " _n_;
run;

In this example, the PUTLOG statement flags records where the USUBJID variable is missing, making it easier to spot and address data quality issues during SDTM creation.

5. Advanced Array Processing for Repeated Measures

In certain domains like Vital Signs (VS) or Lab (LB), handling repeated measures for patients across multiple visits is common. Using arrays can help streamline this process. Here’s a basic example of using an array to process repeated lab measurements:

data lab_repeated;
    set sdtm.lb;
    array lb_vals{3} lbtestcd1-lbtestcd3;
    do i=1 to dim(lb_vals);
        if lb_vals{i} = . then putlog "WARNING: Missing value for LBTESTCD at " _n_;
    end;
run;

This code uses an array to loop through repeated lab test results, ensuring that missing values are flagged for review. Such array-based techniques are essential when processing large, multidimensional datasets in SDTM programming.

6. Best Practices for CDISC Validation and Compliance

To ensure SDTM datasets are CDISC compliant, it’s vital to validate your datasets using tools like Pinnacle 21 or OpenCDISC. These tools check compliance against SDTM standards, flagging any inconsistencies or issues.

Make sure to incorporate validation steps into your workflows regularly. This can be done by running the validation tool after each major dataset creation and by including clear annotations in your programming code to ensure traceability for audits.

Conclusion

Advanced SDTM programming requires a mix of technical expertise and strategic thinking. Whether you are optimizing large datasets, automating repetitive tasks with macros, or ensuring CDISC compliance, staying updated with these advanced techniques will enhance your efficiency and ensure high-quality deliverables in clinical trials. Remember, SDTM programming is not just about writing code—it's about delivering accurate, compliant, and reliable data for critical decision-making.

For more tips and tutorials, check out other PharmaSUG resources or SAS support!

Excluding Variables in PROC COMPARE Using the DROP Dataset Option

Excluding Variables in `PROC COMPARE` Using the `DROP` Dataset Option

When comparing two datasets using PROC COMPARE in SAS, there may be cases where you want to exclude specific variables from the comparison. One efficient way to do this is by using the DROP dataset option. This option allows you to remove certain variables from consideration during the comparison process.

Using the `DROP` Dataset Option in `PROC COMPARE`

The DROP dataset option is applied directly to the dataset references in the BASE and COMPARE options of PROC COMPARE. When you use DROP, the specified variables are excluded from the comparison.

Here is the syntax for using the DROP option:

proc compare base=dataset1(drop=var_to_exclude) 
                compare=dataset2(drop=var_to_exclude);
    run;

In this example, var_to_exclude is the name of the variable you want to exclude from both datasets (dataset1 and dataset2) before the comparison is made.

Example: Excluding a Variable from the Comparison

Let's say you have two datasets, sales_2023 and sales_2024, and you want to compare them, but you want to exclude a variable called region from the comparison. Here is how you can do that:

proc compare base=sales_2023(drop=region) 
                compare=sales_2024(drop=region);
    run;

This code ensures that the variable region will not be included in the comparison, while all other variables in the two datasets will be compared.

Excluding Multiple Variables

You can also exclude multiple variables by listing them inside the DROP option, separated by spaces:

proc compare base=sales_2023(drop=region quarter) 
                compare=sales_2024(drop=region quarter);
    run;

In this case, both region and quarter will be excluded from the comparison.

Conclusion

The DROP dataset option is a powerful and simple tool for excluding variables from comparison in PROC COMPARE. It provides flexibility by allowing you to exclude one or more variables without needing to manually specify which variables should be included in the comparison.

By leveraging the DROP option, you can ensure that only relevant variables are considered when comparing datasets, making your SAS programming more efficient and focused.

Mastering PROC SQL: Best Practices for Seamlessly Adding Columns

PROC SQL in SAS is a powerful tool, especially when you need to add new columns to existing tables. Whether you're updating tables with new data or making calculated columns, following best practices ensures your code is efficient, maintainable, and easy to understand.

1. Use `ALTER TABLE` to Add Columns

The most straightforward way to add a column to a table is by using the ALTER TABLE statement. It allows you to add one or more columns without disrupting the existing data structure.

proc sql;
   alter table your_table_name
   add new_column1 num format=8. label='New Column 1',
       new_column2 char(20) label='New Column 2';
quit;

Tip: Always specify the data type (num for numeric, char(n) for character) and add useful formats and labels to keep your columns clear and organized.

2. Avoid Duplicating Column Names

Before adding columns, check if a column with the same name already exists. SAS will throw an error if you try to add a column that’s already in the table.

proc sql;
   describe table your_table_name;
quit;

Use the DESCRIBE statement to view the structure and ensure no duplicate column names exist.

3. Efficient Data Insertion

After adding a column, you may need to initialize it with default values. You can do this in a single step using the UPDATE statement to ensure all rows are filled efficiently:

proc sql;
   alter table your_table_name
   add new_column num format=8. label='New Numeric Column';

   update your_table_name
   set new_column = 0; /* Default value */
quit;

4. Use `CREATE TABLE` for Complex Column Addition

If you need to add columns based on existing data or calculations, it's often better to create a new table. This prevents any unintentional changes to the original table and allows for cleaner manipulation of data:

proc sql;
   create table new_table as
   select existing_column,
          calculated(new_column) as new_column format=8. label='Calculated Column'
   from your_table_name;
quit;

Pro Tip: Use calculated to create columns based on other columns in the same query.

5. Consider Indexing for Performance

If the new column will be used in sorting or joins, adding an index can significantly boost performance. Here’s how to create an index on your new column:

proc sql;
   create index idx_new_column on your_table_name(new_column);
quit;

6. Use Comments for Clarity

Always document your SQL code! Adding comments helps future developers (or your future self) understand the purpose of each column addition:

proc sql;
   alter table your_table_name
   add new_column char(10) label='New Character Column'; /* Added for analysis purposes */
quit;

Final Thoughts

By following these best practices for adding columns via PROC SQL, you can keep your SAS programming efficient and clear. Whether you're maintaining legacy systems or building new reports, these tips will help you seamlessly integrate new data into your tables.

How to Address PROC COMPARE Reporting Same Values as Different in SAS

Working with large datasets in SAS often requires comparing data between two tables. The PROC COMPARE procedure is an essential tool for this task, but sometimes it reports values as different even when they appear to be identical. This issue can arise from various causes, such as numeric precision differences, rounding issues, or formatting inconsistencies. In this post, we will explore common causes of this issue and how to resolve them.

1. Numeric Precision Issues

SAS stores numeric values using floating-point precision, which can lead to small differences that aren't immediately visible. These differences may cause PROC COMPARE to report discrepancies even though the values seem the same.

Solution: Use the CRITERION or FUZZ option to define an acceptable tolerance for differences.


proc compare base=dataset1 compare=dataset2 criterion=0.00001;
run;

2. Rounding Differences

If values have been rounded differently in two datasets, PROC COMPARE may detect them as different. For example, one dataset may round to two decimal places, while the other doesn't.

Solution: Apply consistent rounding to both datasets before comparison.


data dataset1_rounded;
    set dataset1;
    value = round(value, 0.01); /* Round to two decimal places */
run;

data dataset2_rounded;
    set dataset2;
    value = round(value, 0.01); /* Same rounding precision */
run;

proc compare base=dataset1_rounded compare=dataset2_rounded;
run;

3. Formatting Differences

Sometimes, two values are the same but have different formats applied, leading to a perceived difference by PROC COMPARE.

Solution: Use the NOFORMAT option to ignore formatting in the comparison.


proc compare base=dataset1 compare=dataset2 noformat;
run;

4. Character Value Differences (Case Sensitivity and Whitespace)

SAS is case-sensitive when comparing character variables. Extra whitespace at the end of strings can also cause PROC COMPARE to flag a difference.

Solution: Standardize case and remove any unnecessary spaces using the COMPRESS or UPCASE functions.


data dataset1_clean;
    set dataset1;
    char_var = compress(upcase(char_var));
run;

data dataset2_clean;
    set dataset2;
    char_var = compress(upcase(char_var));
run;

proc compare base=dataset1_clean compare=dataset2_clean;
run;

5. Handling Different Variable Lengths

Character variables with different lengths may also trigger discrepancies in the comparison.

Solution: Ensure that corresponding variables have the same length in both datasets using LENGTH statements.

Conclusion

By addressing issues related to numeric precision, rounding, formatting, and character data, you can reduce or eliminate discrepancies reported by PROC COMPARE in SAS. These solutions ensure more accurate and meaningful comparisons between datasets.

Feel free to leave a comment if you have additional tips or if you’ve encountered other challenges with PROC COMPARE in SAS!

5 Approaches to Identify Blank Columns in SAS Datasets

Author: Sarath

Date: September 5, 2024

When working with large datasets in SAS, it’s crucial to ensure the integrity of your data. One common task is identifying columns that are completely blank, meaning they contain no non-missing values. In this article, we'll explore five different methods to efficiently find blank columns in SAS datasets, along with examples for each approach.

1. Using PROC MEANS or PROC SUMMARY

The simplest way to identify blank columns is by using PROC MEANS or PROC SUMMARY. These procedures provide a summary of missing values for both numeric and character variables. You can quickly determine which columns are fully blank by checking the output for variables with 100% missing values.


proc means data=your_dataset nmiss;
    var _numeric_;  /* For numeric variables */
run;

proc means data=your_dataset nmiss;
    var _character_;  /* For character variables */
run;

This method is effective for both numeric and character variables and gives a concise summary of missing values.

2. Using PROC FREQ

Another method is using PROC FREQ to check the frequency of missing values. This approach is suitable when you want to visually inspect the distribution of missing and non-missing values for each column.


proc freq data=your_dataset;
    tables _all_ / missing;
run;

In the output, look for variables with 100% missing values, which indicates that the column is blank.

3. DATA STEP Approach with Array

This approach involves using a DATA STEP with arrays to iterate over variables and check for missing values. It's a bit more hands-on and allows for customization, such as flagging rows based on certain conditions.


data check_blank;
    set your_dataset;
    array num_vars _numeric_;  /* For numeric variables */
    array char_vars _character_;  /* For character variables */
    all_blank = 1;  /* Initialize flag */

    do over num_vars;
        if num_vars ^= . then all_blank = 0;
    end;
    
    do over char_vars;
        if char_vars ^= '' then all_blank = 0;
    end;

    if all_blank = 1 then output;
run;

This method is useful if you want to customize the logic based on specific variables or values.

4. Using PROC SQL

With PROC SQL, you can query the metadata to identify columns that are fully missing. This approach gives you control over how you want to process the variables and provides a more SQL-like syntax for database programmers.


proc sql;
    select name
    from dictionary.columns
    where libname='WORK' and memname='YOUR_DATASET'
      and missing(name);
quit;

This is a quick and efficient way to find completely blank columns using SQL queries.

5. Using a Macro for Automation

If you need to automate the process of checking blank columns across multiple datasets or variables, writing a macro is the best solution. This macro checks each variable for missing values and flags them accordingly.


%macro find_blank_columns(dataset);
    proc contents data=&dataset out=vars(keep=name type) noprint; run;

    data _null_;
        set vars;
        call symputx('var'||left(_n_), name);
        call symputx('type'||left(_n_), type);
        call symputx('nvars', _n_);
    run;

    data check_blank;
        set &dataset;
        %do i = 1 %to &nvars;
            %if &&type&i = 1 %then %do;
                if &&var&i ne . then blank_flag = 0;
            %end;
            %else %do;
                if &&var&i ne '' then blank_flag = 0;
            %end;
        %end;
    run;
%mend find_blank_columns;

%find_blank_columns(your_dataset);

This macro can easily be modified to suit different datasets and reporting requirements, providing flexibility for more advanced users.

Conclusion

In this article, we've covered five different approaches to identifying completely blank columns in SAS datasets. Whether you prefer PROC MEANS, PROC FREQ, PROC SQL, or a customized DATA STEP solution, there’s a method here to suit your needs. For users working with large and complex datasets, automating this process using a macro can save time and ensure data integrity.

Explore these techniques and adapt them to your specific use case to streamline your data validation process in SAS.

Happy coding!

Effortlessly Upcase All Variables in SAS Using PROC DATASETS

When working with SAS datasets, ensuring consistency across variables, especially character variables, can be crucial. A common requirement is to upcase all character variables, converting their values to uppercase. While several methods exist to achieve this, one of the most efficient and dynamic approaches involves using the PROC DATASETS procedure. In this article, we will delve into how PROC DATASETS works and how you can use it to upcase all character variables in your dataset with minimal effort.

Understanding PROC DATASETS

The PROC DATASETS procedure is primarily used for managing SAS datasets within a library. It allows you to rename, delete, append, modify, and more, without the need to read or copy the data into the Program Data Vector (PDV). This makes it highly efficient, especially when you need to modify dataset attributes without touching the data itself.

For our specific task of upcasing variables, PROC DATASETS is useful because it allows us to apply a format to all character variables at once, without having to manually iterate over each variable.

Step-by-Step: How to Upcase All Character Variables

1. Identify the Dataset

The first step is to identify the dataset that you want to modify. This dataset should already exist in your specified library. For this example, let’s assume our dataset is original_dataset located in the WORK library.

2. Use the MODIFY Statement in PROC DATASETS

To modify a dataset without reading its data, you can use the MODIFY statement inside PROC DATASETS. This step will tell SAS which dataset you want to apply changes to.

3. Apply an Uppercase Format to All Character Variables

Now, the magic of PROC DATASETS lies in its ability to apply formats to variable types globally. By using the FORMAT statement with _character_, you can apply the $upcase. format to every character variable in the dataset.

Complete Code Example

Here is the full SAS code that applies $upcase. format to all character variables:


proc datasets lib=work nolist;
    modify original_dataset;
    format _character_ $upcase.;
run;
quit;

Explanation of the Code

lib=work: Specifies the library where the dataset is located (in this case, the WORK library).
nolist: Suppresses the listing of datasets being modified, keeping the log cleaner.
modify original_dataset: Indicates that we want to modify the dataset named original_dataset.
format _character_ $upcase.: Applies the $upcase. format to all character variables. This automatically converts the contents of these variables to uppercase.
run; and quit;: These statements execute the procedure and exit PROC DATASETS.

Advantages of Using PROC DATASETS for Upcasing

There are several advantages to using PROC DATASETS for upcasing character variables:

Efficiency: PROC DATASETS modifies the dataset in place without reading the data, making it faster and more efficient, especially for large datasets.
Dynamic Application: By using the _character_ keyword, you don’t need to list out each variable manually. It dynamically selects all character variables and applies the upcase format.
Minimal Code: Compared to other methods like loops or arrays, PROC DATASETS requires very little code to achieve the same result.
Works on Multiple Datasets: You can easily modify the code to loop through multiple datasets if needed, by adding more modify statements or using a macro.

Conclusion

Upcasing all character variables in a SAS dataset can be achieved in many ways, but PROC DATASETS offers a streamlined, efficient, and elegant solution. Whether you're dealing with large datasets or want to avoid manually specifying each variable, this method will save you time and effort. Next time you need to perform this task, give PROC DATASETS a try and enjoy its simplicity.

If you have any questions or would like further clarification on using PROC DATASETS in SAS, feel free to leave a comment below!

Mastering SAS: 6 Ways to Upcase All Variables in Your Dataset

How to Upcase All Variables in a SAS Dataset

When working with character data in SAS, you may often need to ensure that all text variables are in uppercase. Fortunately, SAS provides several methods to upcase variables in a dataset, depending on your specific needs. In this article, we will explore various ways to achieve this task and provide code snippets for each method.

1. Using DATA Step with `UPCASE` Function

The UPCASE function is the simplest way to convert all character variables to uppercase using a DATA step.


data upcase_dataset;
    set original_dataset;
    array char_vars _character_; 
    do i = 1 to dim(char_vars);
        char_vars[i] = upcase(char_vars[i]);
    end;
    drop i;
run;

This approach loops through all character variables and applies the UPCASE function to each of them.

2. Using PROC SQL with CASE Statement

In this method, PROC SQL is used with an explicit UPCASE call for each character variable. While this approach is less dynamic, it works well for small datasets.


proc sql;
    create table upcase_dataset as
    select upcase(var1) as var1,
           upcase(var2) as var2,
           upcase(var3) as var3
    from original_dataset;
quit;

3. Using PROC DATASETS and FORMAT

With PROC DATASETS, you can apply an uppercase format to all character variables in the dataset in one go.


proc datasets lib=work nolist;
    modify original_dataset;
    format _character_ $upcase.;
run;
quit;

4. Using a Macro with PROC SQL

This dynamic method uses a macro to automatically identify and upcase all character variables in the dataset. This is useful for larger datasets or when the number of character variables is unknown.


%macro upcase_all_vars(ds);
    proc sql noprint;
        select name into :char_vars separated by ' '
        from dictionary.columns
        where libname='WORK' and memname=upcase("&ds") and type='char';
    quit;

    data upcase_dataset;
        set &ds;
        %let count = %sysfunc(countw(&char_vars));
        %do i = 1 %to &count;
            %let var = %scan(&char_vars, &i);
            &var = upcase(&var);
        %end;
    run;
%mend;

%upcase_all_vars(original_dataset);

5. Using PROC FCMP

This approach involves creating a custom function using PROC FCMP and then applying it across all character variables.


proc fcmp outlib=work.functions.dataset_utils;
    function upcase_all(var $) $;
        return (upcase(var));
    endsub;
run;

options cmplib=work.functions;

data upcase_dataset;
    set original_dataset;
    array char_vars _character_;
    do i = 1 to dim(char_vars);
        char_vars[i] = upcase_all(char_vars[i]);
    end;
    drop i;
run;

6. Using HASH Object

An advanced approach is to use the SAS HASH object, which efficiently handles the upcase operation, especially when dealing with large datasets.


data upcase_dataset;
    if _n_ = 1 then do;
        declare hash h(dataset: 'original_dataset');
        h.defineKey('_N_');
        h.defineData('_ALL_');
        h.defineDone();
    end;

    set original_dataset;
    array char_vars _character_;
    do i = 1 to dim(char_vars);
        char_vars[i] = upcase(char_vars[i]);
    end;
    drop i;
run;

Conclusion

Each of the methods discussed above has its own strengths, depending on the size of the dataset and whether the variables are known in advance. Whether you're working with small or large datasets, SAS offers a variety of ways to upcase all variables. By selecting the most appropriate approach, you can simplify your workflow and ensure consistency across your data.

If you have any questions or need further assistance, feel free to leave a comment below!

Dynamic Macro Creation in SAS: Enhancing Automation and Flexibility

Dynamic macro creation is a powerful technique in SAS that allows you to generate macro variables and macros based on data content or logic at runtime. This not only simplifies repetitive tasks but also provides a way to dynamically control program flow. In this article, we’ll cover various scenarios and provide multiple examples where dynamic macro creation can be beneficial.

Why Use Dynamic Macros?

Automation: Automate repetitive processes by generating macro variables based on dataset values.
Flexibility: Dynamic macros adjust based on the changing content of your datasets.
Efficient Code: Using dynamic macros reduces redundancy and ensures that your code adapts to different data structures without manual intervention.

Scenario 1: Generating Macros Based on Dataset Variables

Imagine a scenario where you have a dataset and need to dynamically create macros to store variable names or their values. This is useful for automating variable processing tasks, such as generating reports, manipulating data, or performing analyses.

Example 1: Creating Macros for Variable Names

/* Create macros for each unique variable in the dataset */
proc sql;
   select distinct name
   into :var1-:varN
   from sashelp.class;
quit;

%macro display_vars;
   %do i=1 %to &sqlobs;
      %put &&var&i;
   %end;
%mend display_vars;

%display_vars;

Explanation: This code dynamically selects variable names from the sashelp.class dataset and stores them in macro variables. The macro display_vars prints out each variable name, allowing flexible processing of variables without knowing them in advance.

Scenario 2: Automating Data Processing Based on Unique Values

Let’s say you have multiple categories or groups within your data, and you need to run a set of analyses or create reports for each unique group. You can dynamically generate macros for each category, making the process scalable and automated.

Example 2: Creating Macros for Unique Categories

/* Create macros for each unique 'sex' category in the dataset */
proc sql;
   select distinct sex
   into :sex1-:sexN
   from sashelp.class;
quit;

%macro analyze_sex;
   %do i=1 %to &sqlobs;
      proc print data=sashelp.class;
         where sex = "&&sex&i";
         title "Listing for Sex = &&sex&i";
      run;
   %end;
%mend analyze_sex;

%analyze_sex;

Explanation: This example dynamically creates a macro for each unique sex value and applies a filter in PROC PRINT for each group, generating reports for each distinct value in the dataset.

Scenario 3: Dynamically Generating Conditional Code

In some cases, you need to execute different code based on the values of certain variables or the content of a dataset. Dynamic macro creation helps generate conditional logic on the fly.

Example 3: Conditional Code Generation Based on Data Content

/* Identify numeric variables in the dataset and generate macro code */
proc sql;
   select name
   into :numvar1-:numvarN
   from dictionary.columns
   where libname='SASHELP' and memname='CLASS' and type='num';
quit;

%macro analyze_numeric_vars;
   %do i=1 %to &sqlobs;
      proc means data=sashelp.class;
         var &&numvar&i;
         title "Analysis of &&numvar&i";
      run;
   %end;
%mend analyze_numeric_vars;

%analyze_numeric_vars;

Explanation: This code identifies numeric variables in the dataset and dynamically creates macro variables for each. The macro analyze_numeric_vars runs PROC MEANS for each numeric variable, adjusting to any changes in the dataset structure.

Scenario 4: Dynamic Report Generation

Dynamic macros are helpful in generating dynamic reports or exporting data where the structure or content changes frequently. You can use dynamic macros to control file names, report titles, or export paths.

Example 4: Dynamic Report Titles

/* Generate macros for each unique 'name' value and create reports */
proc sql;
   select distinct name
   into :name1-:nameN
   from sashelp.class;
quit;

%macro create_reports;
   %do i=1 %to &sqlobs;
      proc print data=sashelp.class;
         where name = "&&name&i";
         title "Report for &&name&i";
      run;
   %end;
%mend create_reports;

%create_reports;

Explanation: This code dynamically creates a report for each unique name in the dataset, with the report’s title reflecting the name being processed. The code adapts to changes in the dataset, automating the report generation process.

Best Practices for Dynamic Macro Creation

Use PROC SQL with INTO Clauses: This is the most efficient way to generate dynamic macro variables from dataset content.
Limit the Number of Macros: Ensure that you don’t exceed the macro variable limit by limiting the number of macros generated.
Use &sqlobs: The &sqlobs macro variable is useful for counting the number of records or unique values, ensuring the loop runs the correct number of times.
Avoid Hardcoding: Whenever possible, rely on dynamic macros instead of hardcoding variable names or values to make your code flexible and adaptable.
Error Handling: Implement error handling and checks to ensure that dynamic macros are generated correctly without issues during execution.

Conclusion

Dynamic macro creation in SAS provides a robust and flexible way to automate repetitive tasks, process data efficiently, and adjust code dynamically based on dataset content. By generating macros based on variables or values within a dataset, you can create dynamic, scalable solutions for various SAS programming challenges.

Comparing VISIT and VISITNUM Values Across SDTM Datasets and the TV Domain

Extracting and Comparing Unique `VISIT` and `VISITNUM` Values from SDTM Datasets

Author: [Your Name]

Date: [Creation Date]

In clinical trials, the VISIT and VISITNUM variables are key identifiers for subject visits. Ensuring that all datasets have consistent visit data and that it aligns with the planned visits recorded in the TV (Trial Visits) domain is crucial for accurate data analysis. This post presents a SAS macro that automates the extraction of unique VISIT and VISITNUM values across all SDTM datasets in a library and compares them to those found in the TV domain.

Program Overview

The SAS macro program:

Extracts unique VISIT and VISITNUM values from all SDTM datasets in the specified library.
Compares these values against those recorded in the TV domain.
Highlights any discrepancies between the SDTM datasets and the TV domain.

Macro Code

Here’s the macro that performs the task:

%macro compare_visit(libname=);

    /* Step 1: Get the unique VISIT and VISITNUM values from the TV domain */
    proc sql;
        create table tv_visit as
        select distinct VISIT, VISITNUM 
        from &libname..TV
        where VISIT is not missing and VISITNUM is not missing;
    quit;

    /* Step 2: Get the list of datasets in the library containing both VISIT and VISITNUM */
    proc sql noprint;
        select memname 
        into :dslist separated by ' '
        from sashelp.vcolumn
        where libname = upcase("&libname")
          and name in ('VISIT', 'VISITNUM')
        group by memname
        having count(distinct name) = 2; /* Ensure both VISIT and VISITNUM are present */
    quit;

    /* Step 3: Check if any datasets were found */
    %if &sqlobs = 0 %then %do;
        %put No datasets in &libname contain both VISIT and VISITNUM variables.;
    %end;
    %else %do;
        %put The following datasets contain both VISIT and VISITNUM variables: &dslist;

        /* Initialize an empty dataset for combined VISIT and VISITNUM values */
        data combined_visits;
            length Dataset $32 VISIT $200 VISITNUM 8;
            stop;
        run;

        /* Step 4: Loop through each dataset */
        %let ds_count = %sysfunc(countw(&dslist));
        %do i = 1 %to &ds_count;
            %let dsname = %scan(&dslist, &i);

            /* Extract unique VISIT and VISITNUM values, excluding UNSCHEDULED visits */
            proc sql;
                create table visit_&dsname as
                select distinct "&&dsname" as Dataset, VISIT, VISITNUM
                from &libname..&&dsname
                where VISIT is not missing and VISITNUM is not missing
                  and VISIT not like 'UNSCH%'; /* Exclude UNSCHEDULED visits */
            quit;

            /* Append to the combined dataset */
            proc append base=combined_visits data=visit_&dsname force;
            run;
        %end;

        /* Step 5: Compare combined VISIT/VISITNUM with TV domain */
        proc sql;
            create table visit_comparison as
            select a.*, b.Dataset as In_SDTC_Dataset
            from tv_visit a
            left join combined_visits b
            on a.VISIT = b.VISIT and a.VISITNUM = b.VISITNUM
            order by VISITNUM, VISIT;
        quit;

        /* Step 6: Display the comparison results */
        proc print data=visit_comparison;
        title "Comparison of VISIT/VISITNUM between TV and SDTM Datasets (Excluding Unscheduled Visits)";
        run;
    %end;

%mend compare_visit;

/* Run the macro by specifying your SDTM library name */
%compare_visit(libname=sdtm);

How the Macro Works

This macro performs the following steps:

It first extracts all unique VISIT and VISITNUM values from the TV domain.
It then identifies all datasets in the specified library that contain the VISIT and VISITNUM variables by querying the metadata table SASHELP.VCOLUMN.
For each identified dataset, the macro extracts the distinct VISIT and VISITNUM values and appends them into a consolidated dataset.
Finally, it compares the combined results from the SDTM datasets against the values in the TVff domain and displays any discrepancies.

Use Case

This macro is especially useful when checking that all actual visits recorded in the SDTM datasets align with the planned visits documented in the TV domain. Ensuring consistency between these values is essential for accurate clinical trial reporting and analysis.

Example of Use:

%compare_visit(libname=sdtm);

In this example, the macro will search for VISIT and VISITNUM variables in the SDTM datasets located in the sdtm library and compare them with the values in the TV domain.

Conclusion

By automating the process of extracting and comparing VISIT and VISITNUM values, this macro simplifies what could otherwise be a tedious and error-prone task. It ensures that all visit data is consistent and complete, aligning the planned and actual visits in the SDTM datasets.

Feel free to adapt this macro to meet your specific needs in clinical trials data management!

Disclosure:

In the spirit of transparency and innovation, I want to share that some of the content on this blog is generated with the assistance of ChatGPT, an AI language model developed by OpenAI. While I use this tool to help brainstorm ideas and draft content, every post is carefully reviewed, edited, and personalized by me to ensure it aligns with my voice, values, and the needs of my readers. My goal is to provide you with accurate, valuable, and engaging content, and I believe that using AI as a creative aid helps achieve that. If you have any questions or feedback about this approach, feel free to reach out. Your trust and satisfaction are my top priorities.

Discover More Tips and Techniques on This Blog

Understanding EC vs. EX Domains in SDTM: When to Use Each

Key Updates from PharmaSUG Papers:

When to Use the EC and EX Domains

EC (Exposure as Collected) Domain:

EX (Exposure) Domain:

Key Takeaways for EC and EX Domain Usage

Introduction to EC and EX Domains

Example: Titration or Adjusted Dosing Scenario

When to Use the EX Domain

Example: Simple Dosing Scenario

Using Both EC and EX Domains Together

Example: Combined Use of EC and EX Domains

Additional Considerations for Submission

Conclusion

Study Start Date in SDTM – Why Getting It Right Matters

FDA’s Definition of Study Start Date

Why is the Study Start Date Important?

Where to Record the Study Start Date

Challenges in Defining the Study Start Date

How to Verify the Study Start Date

Global Considerations

Multiple Ways to Derive Stusy Start Date (SSTDTC) in SDTM

1. Using PROC SQL

2. Using DATA STEP with RETAIN

3. Using PROC MEANS

4. Using PROC SORT and DATA STEP

Conclusion

References

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

1. The Common Approach and Its Pitfall

2. Best Practice 1: Use a Temporary Table

3. Best Practice 2: Use ALTER TABLE for Adding Columns

4. Best Practice 3: Consider the DATA Step MERGE for Large Datasets

Conclusion

Mastering the SDTM Review Process: Comprehensive Insights with Real-World Examples

1. Understanding the Complexity of SDTM Review

Real-World Example: Misinterpreting Domain Assumptions

2. Leveraging Pinnacle 21: What It Catches and What It Misses

Real-World Example: Inapplicable Variables Passing Pinnacle 21

3. Key Findings in the Review Process

3.1 General Findings

Real-World Example: Incorrect Dates in a Global Trial

3.2 Domain-Specific Findings

Real-World Example: Incorrect AGEU in a Pediatric Study

4. Optimizing the SDTM Review Process

5. Conclusion: Building a Holistic SDTM Review Process

Introduction

What is SDTM?

How ChatGPT Can Enhance SDTM Programming

Practical Use Cases for ChatGPT in SDTM Programming

Limitations and Future Potential

Conclusion

Unleashing the Power of PROC DATASETS in SAS

1. Why Use PROC DATASETS?

2. Basic Syntax of PROC DATASETS

3. Use Case 1: Renaming Datasets and Variables

4. Use Case 2: Appending Datasets

5. Use Case 3: Deleting Datasets

6. Use Case 4: Modifying Attributes

7. Advanced Operations with PROC DATASETS

7.1. Working with Audit Trails

7.2. Managing Indexes

7.3. Cascading File Renaming with the AGE Command

Checking If a SAS Dataset is Sorted Using PROC DATASETS

Why Use PROC DATASETS to Check Sort Status?

Step-by-Step Example

Sample SAS Code

Code Explanation

Interpreting the Output

Example Output

Handling Multiple Datasets

Conclusion

Sources

Advanced SDTM Mapping Pitfalls and How to Avoid Them

Introduction

Pitfall 1: Handling Derived Variables Incorrectly

Example:

Solution:

Pitfall 2: Inconsistent Mapping of Medical History (MH) and Adverse Events (AE)

3. Best Practice 2: Use `ALTER TABLE` for Adding Columns

4. Best Practice 3: Consider the `DATA Step MERGE` for Large Datasets

Checking If a SAS Dataset is Sorted Using `PROC DATASETS`

Why Use `PROC DATASETS` to Check Sort Status?

Excluding Variables in `PROC COMPARE` Using the `DROP` Dataset Option

Using the `DROP` Dataset Option in `PROC COMPARE`

1. Use `ALTER TABLE` to Add Columns

4. Use `CREATE TABLE` for Complex Column Addition

1. Using DATA Step with `UPCASE` Function