Tuesday, December 17, 2024

Learn how to view SAS dataset labels without opening the dataset directly in a SAS session. Easy methods and examples included!

Quick Tip: See SAS Dataset Labels Without Opening the Data

Quick Tip: See SAS Dataset Labels Without Opening the Data

When working with SAS datasets, checking the dataset label without actually opening the data can be very useful. Whether you're debugging or documenting, this small trick can save you time and effort!

1. Use PROC CONTENTS

PROC CONTENTS is the most common and straightforward way to view the dataset label.

proc contents data=yourlib.yourdataset;
run;

The dataset label will appear in the output as the field: Data Set Label.

2. Query DICTIONARY.TABLES or SASHELP.VTABLE

For a programmatic approach, use the DICTIONARY.TABLES table or SASHELP.VTABLE view to query dataset metadata.

Example Using PROC SQL

proc sql;
   select memname, memlabel 
   from dictionary.tables
   where libname='YOURLIB' and memname='YOURDATASET';
quit;

Example Using SASHELP.VTABLE

proc print data=sashelp.vtable;
   where libname='YOURLIB' and memname='YOURDATASET';
   var memname memlabel;
run;

Both methods will show the dataset's name and its label.

3. Use PROC DATASETS

For advanced users, PROC DATASETS can display dataset attributes, including labels:

proc datasets library=yourlib;
   contents data=yourdataset;
run;
quit;

Why Is This Helpful?

  • Quickly check dataset metadata without loading or viewing the data.
  • Useful for documenting datasets during large projects.
  • Helpful in automation scripts for SAS programming.

Conclusion

Using these methods, you can easily view SAS dataset labels without opening the data. Whether you prefer PROC CONTENTS, querying metadata tables, or PROC DATASETS, the choice depends on your workflow.

Happy coding! If you found this tip useful, don’t forget to share it with your fellow SAS programmers.

Looking for more SAS programming tricks? Stay tuned for more posts on Rupee Stories!

Thursday, December 12, 2024

Detecting Non-Printable Characters in SDTM Datasets Using SAS

Detecting Non-Printable Characters in SDTM Datasets Using SAS

Detecting Non-Printable Characters in SDTM Datasets Using SAS

Non-printable characters in datasets can lead to errors and inconsistencies, especially in the highly regulated environment of clinical trials. This blog post demonstrates how to create a SAS program that identifies non-printable characters in all SDTM datasets within a library and generates a comprehensive report.

Why Detect Non-Printable Characters?

Non-printable characters, such as ASCII values below 32 or above 126, can cause issues during data validation, regulatory submissions, and downstream processing. Detecting them early ensures the quality and compliance of your SDTM datasets.

The SAS Program

The following SAS program processes all SDTM datasets in a library and generates a combined report of non-printable characters, including:

  • Dataset name: The dataset where the issue exists.
  • Variable name: The variable containing non-printable characters.
  • Row number: The row where the non-printable character is found.
  • Positions: The exact position(s) of non-printable characters with their ASCII values.

Program Code

%macro find_nonprintable_all(libname=, output=);

    /* Get a list of all datasets in the library */
    proc sql noprint;
        select memname into :dsetlist separated by ' '
        from dictionary.tables
        where libname = upcase("&libname");
    quit;

    /* Create a combined report */
    data &output.;
        length _dataset $32 _varname $32 _rownum 8 _position_list $500;
        retain _dataset _varname _rownum _position_list;
        call missing(_dataset, _varname, _rownum, _position_list);
        stop;
    run;

    /* Loop through each dataset and find non-printable characters */
    %let count = %sysfunc(countw(&dsetlist));
    %do i = 1 %to &count;
        %let dset = %scan(&dsetlist, &i);

        data temp_report;
            set &libname..&dset.;
            length _dataset $32 _varname $32 _rownum 8 _position_list $500;
            retain _dataset _varname _rownum _position_list;
            _dataset = "&dset";
            array charvars {*} _character_; /* Select all character variables */

            do i = 1 to dim(charvars);
                _varname = vname(charvars[i]);
                _rownum = _n_;
                _position_list = '';

                /* Check each character in the variable */
                do j = 1 to length(charvars[i]);
                    charval = substr(charvars[i], j, 1);
                    ascii_val = rank(charval);

                    /* Flag non-printable characters */
                    if ascii_val < 32 or ascii_val > 126 then do;
                        _position_list = catx(', ', _position_list, 
                                              "Position=" || put(j, best.) || " (ASCII=" || ascii_val || ")");
                    end;
                end;

                /* Output if any non-printable characters are found */
                if not missing(_position_list) then output;
            end;

            drop i j charval ascii_val;
        run;

        /* Append to the combined report */
        proc append base=&output. data=temp_report force;
        run;

        /* Clean up temporary dataset */
        proc datasets lib=work nolist;
            delete temp_report;
        quit;
    %end;

%mend;

/* Example usage */
%find_nonprintable_all(libname=sdtm, output=nonprintable_combined_report);

/* Review the combined report */
proc print data=nonprintable_combined_report noobs;
    title "Non-Printable Characters Report for All Datasets in the Library";
run;
        

How It Works

The program processes each dataset in the specified library, examines all character variables for non-printable characters, and records their positions in a combined report.

Output

The final report contains the following columns:

  • _dataset: Name of the dataset.
  • _varname: Name of the variable.
  • _rownum: Row number.
  • _position_list: Details of non-printable character positions and ASCII values.

Conclusion

Using this SAS program, you can proactively identify and address non-printable characters in SDTM datasets, ensuring data integrity and compliance. Feel free to adapt this program for your specific needs.

Friday, December 6, 2024

SDTM aCRF Annotation Checklist

SDTM aCRF Annotation Checklist

SDTM aCRF Annotation Checklist

By Sarath Annapareddy

Introduction

Creating an SDTM Annotated Case Report Form (aCRF) is a critical step in clinical trial data submission. It ensures that data collected in the CRF maps correctly to SDTM domains, adhering to regulatory and CDISC standards. This checklist serves as a guide to creating a high-quality SDTM aCRF ready for regulatory submission.

1. General Formatting

  • Ensure the aCRF uses the latest SDTM IG version relevant to the study.
  • The document should be clean, legible, and free of overlapping annotations.
  • Page numbers in the aCRF should align with the actual CRF pages.
  • Annotations must be in English, clear, and consistently formatted.
  • Use color coding to differentiate domain mappings, derived variables, and special-purpose annotations.

2. Domain-Level Annotations

  • Annotate each field on the CRF with the corresponding SDTM variable name (e.g., DMAGE, LBTEST).
  • Ensure every field includes an appropriate domain prefix (e.g., DM, AE).
  • Unmapped fields should be labeled with "Not Mapped" or "NM".
  • Ensure proper usage of variable cases (e.g., all uppercase for SDTM variable names).

3. Data Collection Fields

  • Map demographic fields (e.g., SEX, RACE) to the `DM` domain.
  • Adverse event fields (e.g., event name, severity) should map to the `AE` domain.
  • Laboratory test results and units should map to the `LB` domain.
  • Exposure data (e.g., drug start/stop dates) must align with the `EX` domain.
  • Use the `DV` domain for protocol deviations and the `MH` domain for medical history.

4. Special-Purpose Domains

  • SUPPQUAL should be used for non-standard variables.
  • RELREC annotations are required for defining relationships between domains.
  • Free-text comments should map to the `CO` domain.
  • Trial design fields should map to domains like `TA`, `TV`, and `TS`.

5. Derived and Computed Variables

  • Derived variables must be clearly labeled (e.g., "DERIVED" or "CALCULATED").
  • Ensure annotations for variables like BMI reference all contributing fields (e.g., height and weight).
  • Visit variables (e.g., VISITNUM, VISITDY) should align with RFSTDTC.

6. Date and Time Variables

  • All date fields must follow ISO 8601 format (e.g., AESTDTC, EXSTDTC).
  • Derived date variables like VISITDY should be calculated relative to RFSTDTC.

7. Validation and Quality Control

  • Validate the aCRF against the finalized SDTM datasets.
  • Ensure alignment with the Define.xml document.
  • Conduct reviews by the programming and data management teams.
  • Perform a completeness check to ensure no fields are left unannotated.

8. Regulatory Submission Readiness

  • Ensure compliance with the requirements of regulatory authorities (e.g., FDA, PMDA).
  • Submit the aCRF in a searchable, bookmarked PDF format.
  • Verify that all color-coded annotations are visible in grayscale for printed versions.
  • Include a cover page with the study title, protocol number, and version.

Conclusion

A well-annotated SDTM aCRF is crucial for successful regulatory submissions. By following this checklist, you can ensure your aCRF meets compliance requirements and demonstrates traceability between the CRF, datasets, and Define.xml. This meticulous process not only ensures regulatory approval but also enhances the credibility of your clinical trial data.

© 2024 Rupee Stories. All rights reserved.

Common P21 SDTM Compliance Issues and How to Resolve Them

Common P21 SDTM Compliance Issues and How to Resolve Them

Common P21 SDTM Compliance Issues and How to Resolve Them

By Sarath Annapareddy

Introduction

Pinnacle 21 (P21) is a cornerstone for validating SDTM datasets against CDISC standards. Its checks ensure compliance with regulatory requirements set by the FDA, PMDA, and other authorities. However, resolving the issues flagged by P21 can be challenging, especially for beginners. This post dives into common P21 compliance issues and provides actionable solutions with examples.

1. Missing or Invalid Controlled Terminology

Issue: P21 flags variables like LBTESTCD or SEX as non-compliant with CDISC Controlled Terminology (CT). This happens when values in your datasets are outdated or invalid.

Solution: Update your CT files regularly from CDISC’s website. Use validation scripts to cross-check your datasets against the CT list.

Example:

data lab_final;
    merge lab_data (in=a)
          cdisc_ct (in=b);
    by LBTESTCD;
    if a and not b then put "WARNING: Invalid value for LBTESTCD=" LBTESTCD;
run;
            

This code validates lab data against CDISC Controlled Terminology and flags invalid entries.

2. Missing Required Variables

Issue: P21 highlights missing essential variables such as USUBJID, DOMAIN, and VISITNUM. Missing these variables can result in non-compliance.

Solution: Create validation macros in SAS to check for the presence of required variables. Always refer to the SDTM IG for domain-specific requirements.

Example:

%macro check_vars(dataset, vars);
    proc sql noprint;
        select count(*) 
        into :missing
        from dictionary.columns
        where libname="WORK" and memname=upcase("&dataset")
              and name not in (&vars);
    quit;
    %if &missing > 0 %then %put ERROR: Missing required variables!;
%mend;
%check_vars(lab_data, "USUBJID, DOMAIN, VISITNUM");
            

3. Inconsistent Dates and Timestamps

Issue: Non-compliance with ISO 8601 date format is a recurring issue. Variables such as AESTDTC or VISITDY may have incorrect formats or incomplete components.

Solution: Convert dates to ISO format during mapping and ensure consistent formats across datasets using SAS functions like PUT and INPUT.

Example:

data ae_final;
    set ae_raw;
    if not missing(AESTDT) then AESTDTC = put(AESTDT, E8601DA.);
run;
            

4. Duplicate Records

Issue: Duplicate records are flagged when unique combinations of keys (like USUBJID and VISITNUM) appear multiple times in a domain.

Solution: Implement deduplication techniques in SAS and ensure proper use of keys during dataset creation.

Example:

proc sort data=dm nodupkey;
    by USUBJID VISITNUM;
run;
            

5. Incomplete Traceability

Issue: P21 flags issues when derived variables or supplemental qualifiers lack proper traceability.

Solution: Clearly document derivations in your dataset specifications and use RELREC or SUPPQUAL datasets for maintaining traceability.

Example:

data suppae;
    set ae;
    where AESER = "Y";
    IDVAR = "SEQ";
    QNAM = "AESER";
    QVAL = AESER;
run;
            

6. Inconsistent Metadata

Issue: P21 reports mismatches between Define.xml and dataset metadata.

Solution: Automate Define.xml generation using tools like Pinnacle 21 Enterprise. Manually cross-check metadata during QC.

7. Invalid Links in RELREC

Issue: RELREC relationships do not align with the protocol-defined data structure.

Solution: Double-check all relationships during dataset creation and validate RELREC against its source domains.

Conclusion

Resolving P21 compliance issues requires both a strategic approach and practical programming skills. By addressing these common problems, you can ensure your datasets are regulatory-compliant, saving time and avoiding costly re-submissions.

© 2024 Rupee Stories. All rights reserved.

Tuesday, December 3, 2024

Comprehensive QC Checklist for Define.xml and cSDRG: Ensuring Quality and Compliance for FDA and PMDA SDTM Submissions

Define.xml and cSDRG QC Checklist for FDA and PMDA Submissions

Comprehensive QC Checklist for Define.xml and cSDRG

Ensuring Quality and Compliance for FDA and PMDA SDTM Submissions

Introduction

The **Define.xml** and **Clinical Study Data Reviewer’s Guide (cSDRG)** are critical components of SDTM submissions to regulatory agencies like the FDA and PMDA. These documents help reviewers understand the structure, content, and traceability of the datasets submitted. A robust QC process ensures compliance with agency requirements, minimizes errors, and enhances submission success. This blog outlines a detailed manual QC checklist for both Define.xml and cSDRG, emphasizing key differences between FDA and PMDA requirements.

Define.xml QC Checklist

1. Metadata Verification

  • Verify all datasets listed in Define.xml are included in the submission package.
  • Check that all variable metadata (e.g., variable names, labels, types, and lengths) matches the SDTM datasets.
  • Ensure consistency between controlled terminology values and the CDISC Controlled Terminology files.
  • Confirm all mandatory fields (e.g., Origin, Value Level Metadata, Comments) are correctly populated.

2. Controlled Terminology

  • Ensure variables like AEDECOD, LBTESTCD, and CMTRT align with the latest CDISC Controlled Terminology.
  • Check NCI Codelist codes for correctness and proper linkage to variables.
  • Verify that SUPPQUAL domains reference appropriate `QNAM` and `QVAL` values.

3. Links and Traceability

  • Ensure all hyperlinks in Define.xml (e.g., links to codelists, Value Level Metadata, and external documents) are functional.
  • Verify traceability for derived variables to source data or algorithms.

4. Value Level Metadata

  • Check that Value Level Metadata is used for variables with differing attributes (e.g., QVAL in SUPPQUAL).
  • Validate metadata application to specific values, ensuring alignment with dataset content.

5. Technical Validation

  • Run Define.xml through Pinnacle 21 or a similar validation tool to identify errors or warnings.
  • Validate XML structure against the CDISC Define-XML schema (e.g., UTF-8 encoding).

6. Documentation

  • Ensure accurate descriptions in the Comments section for clarity and traceability.
  • Check consistency between Define.xml and cSDRG descriptions.

cSDRG QC Checklist

1. Content Consistency

  • Ensure alignment with Define.xml in terms of datasets, variables, and controlled terminology.
  • Verify consistency with CDISC guidelines for SDRG structure and content.

2. Document Structure

  • Ensure all required sections are present:
    • Study Design Overview
    • Dataset-Specific Considerations
    • Traceability and Data Processing
    • Controlled Terminology
  • Verify the inclusion of Acronyms and Abbreviations.

3. Dataset-Level Review

  • Check that all datasets referenced in cSDRG are included in the Define.xml and the submission package.
  • Verify clear descriptions of dataset-specific issues (e.g., imputed values, derived variables).

4. Traceability and Data Processing

  • Ensure documentation of traceability from raw data to SDTM datasets.
  • Validate derivation rules for key variables.

5. Controlled Terminology

  • Ensure controlled terminology usage aligns with Define.xml.
  • Document any deviations or extensions to standard controlled terminology.

6. Reviewer-Focused Content

  • Provide explanations for unusual scenarios (e.g., partial/missing dates, adverse event relationships).
  • Tailor descriptions to a reviewer’s perspective for clarity and usability.

7. Formatting and Usability

  • Ensure consistent fonts, headings, and numbering throughout the document.
  • Verify hyperlinks and table of contents functionality in the PDF format.

FDA vs. PMDA Considerations

While FDA and PMDA share many requirements, there are some critical differences:

Aspect FDA PMDA
Encoding UTF-8 UTF-8 (focus on Japanese character encoding)
Validation Tools Pinnacle 21 Community/Enterprise Pinnacle 21 with PMDA-specific rules
Trial Summary (TS) Focus on mandatory fields Greater emphasis on PMDA-specific fields
Language English English and Japanese

Conclusion

Ensuring high-quality Define.xml and cSDRG documents is crucial for successful regulatory submissions to FDA and PMDA. Adhering to the detailed QC checklists outlined above will help identify and address issues early, saving time and reducing the risk of rejection. Tailoring your approach to the specific requirements of each agency ensures a smooth review process and enhances submission success rates.

Data Quality Checks for SDTM Datasets: FDA vs. PMDA: Understanding Regulatory Requirements for Submission Success

FDA vs PMDA Data Quality Checks

Differences in Data Quality Checks Between FDA and PMDA

Introduction

Submitting SDTM datasets to regulatory authorities like the FDA (U.S. Food and Drug Administration) and PMDA (Japan's Pharmaceuticals and Medical Devices Agency) involves rigorous data quality checks. While both agencies adhere to CDISC standards, their submission guidelines and expectations differ in certain aspects. This blog explores the key differences in data quality checks for FDA and PMDA submissions.

Similarities in Data Quality Checks

Both FDA and PMDA share several common expectations for SDTM datasets:

  • Adherence to CDISC Standards: Both agencies require compliance with the SDTM Implementation Guide (SDTM-IG).
  • Controlled Terminology (CT): Variables such as AEDECOD and LBTESTCD must align with CDISC CT.
  • Traceability: Ensures that derived datasets and analysis results can be traced back to the raw data.
  • Define.xml Validation: Both agencies expect a complete and validated Define.xml file for metadata documentation.

Differences in Data Quality Checks

The FDA and PMDA have distinct preferences and requirements that need careful attention.

Aspect-wise Comparison

Aspect FDA PMDA
Validation Tools Primarily uses Pinnacle 21 Community or Enterprise for validation.
Emphasis on "Reject" and "Error" findings.
Relies on Pinnacle 21, but PMDA-specific validation rules are stricter.
Additional checks on Japanese language and character encoding (e.g., UTF-8).
Validation Rules Focuses on U.S.-specific regulatory rules.
Requires adherence to SDTM-IG versions commonly used in the U.S.
Requires alignment with Japanese-specific validation rules.
More emphasis on Trial Summary (TS) and demographic consistency.
Trial Summary (TS) Domain FDA expects a complete TS domain but is less stringent on content beyond mandatory fields. PMDA places greater importance on the TS domain, especially for regulatory codes specific to Japan.
Japanese Subjects Less emphasis on Japanese-specific requirements. Requires additional checks for Japanese subjects, such as proper handling of kanji characters.

1. Data Validation and Tools

FDA:

  • Relies on specific validation tools like Pinnacle 21 Community/Enterprise to check data compliance.
  • FDA has stringent validator rules listed in their Study Data Technical Conformance Guide.
  • Focus is on ensuring conformance to CDISC standards such as SDTM, ADaM, and Define.xml.

PMDA:

  • Uses a custom validation framework with a focus on Study Data Validation Rules outlined in PMDA guidelines.
  • PMDA also emphasizes conformance but requires additional steps for documenting electronic data submissions.

2. Submission File Formats and Organization

FDA:

  • Requires datasets in SAS Transport Format (.xpt).
  • Submission files need to adhere to the eCTD format.
  • Technical specifications like split datasets (e.g., DM datasets with large record counts) need clear organization.

PMDA:

  • Aligns with the same .xpt requirement but often asks for additional metadata and dataset-specific documentation.
  • Detailed instructions on submission through the PMDA Gateway System.
  • PMDA requires notification submissions and extensive Q&A clarifications on data contents.

3. Controlled Terminologies and Dictionaries

FDA:

  • Requires compliance with the latest MedDRA and WHODrug versions.
  • MedDRA coding consistency is emphasized for all terms and values.

PMDA:

  • Accepts MedDRA and WHODrug but requires detailed mapping between collected data and coded terms.
  • Has additional checks for Japanese coding conventions and translations.

4. Define.xml

FDA:

  • Emphasizes alignment between dataset variables, labels, and metadata.
  • Requires accurate representations of origins (e.g., CRF, Derived).

PMDA:

  • Additional scrutiny on variable origins and alignment with Japanese electronic standards.
  • PMDA often requires clarifications for variables derived from external sources or referenced across multiple studies.

5. Reviewer’s Guides (cSDRG and ADRG)

FDA:

  • Provides guidance through templates like the cSDRG and ADRG.
  • Focus on study-level explanations for data inconsistencies, derivations, and non-standard elements.

PMDA:

  • Requires more detailed explanations in cSDRG and ADRG, especially regarding:
    • Variables annotated as Not Submitted.
    • Handling of adjudication or screen failure data.

6. Data Quality Focus

FDA:

  • Prioritizes ensuring datasets conform to the FDA Technical Specifications.
  • Consistency across study datasets within a submission is critical.

PMDA:

  • Prioritizes consistency between variables and detailed documentation of derivations.
  • More focused on mapping between raw data and analysis-ready datasets.

7. Study Tagging Files (STF)

FDA:

  • Requires STF to categorize and link datasets, programs, and metadata documents in the submission.

PMDA:

  • Similar to the FDA but emphasizes alignment between the STF and Japanese Gateway system submission requirements.

Regulatory Submission Context

Historical Context: The FDA and PMDA have embraced CDISC standards to enhance global harmonization, ensuring data transparency and reproducibility in clinical trials.

Key Objectives: Both agencies aim to ensure data integrity, accuracy, and traceability, facilitating efficient review processes and better regulatory oversight.

Specific Guidance from FDA and PMDA

FDA: The FDA emphasizes adherence to the Study Data Technical Conformance Guide and Data Standards Catalog to align submissions with their expectations.

PMDA: PMDA focuses on their Notifications on Electronic Study Data and their FAQs for addressing specific queries regarding Japanese regulatory requirements.

Operational Challenges

  • Language Considerations: Handling multi-language data, such as English and Japanese, introduces encoding and translation challenges, particularly for kanji characters.
  • Validation Tools Usage: Differences in Pinnacle 21 Community vs. Enterprise versions can create discrepancies in validation reports.

Lessons from Common Errors

Data Compliance Errors: Issues such as incomplete Define.xml, inconsistent controlled terminology, and incorrect TS domain entries are common pitfalls.

Mitigation Strategies: Conduct comprehensive pre-submission reviews, cross-checking both FDA and PMDA guidelines to preempt rejections.

Summary of Key Considerations

Aspect FDA PMDA
Validation Tools Pinnacle 21 PMDA-specific validation rules
Submission System eCTD PMDA Gateway
Focus Conformance to CDISC Standards Metadata and mapping clarifications
Dictionaries MedDRA/WHODrug MedDRA/WHODrug + Japanese translations
Define.xml Focus on CRF origins and labels Additional variable origin documentation
Reviewer’s Guide General inconsistencies and derivations Non-standard elements and adjudication

Conclusion

While FDA and PMDA share a common foundation in CDISC standards, their data quality expectations have nuanced differences. Understanding these distinctions is critical for ensuring smooth submissions. By tailoring your SDTM programming and validation processes to address these unique requirements, you can enhance your submission success rate and streamline regulatory review.

Advanced SDTM Programming Tips: Streamline Your SDTM Development with Expert Techniques

Advanced SDTM Programming Tips

Advanced SDTM Programming Tips

Streamline Your SDTM Development with Expert Techniques

Tip 1: Automating SUPPQUAL Domain Creation

The SUPPQUAL (Supplemental Qualifiers) domain can be automated using SAS macros to handle additional variables in a systematic way. Refer to the macro example provided earlier to simplify your SUPPQUAL generation process.

Tip 2: Handling Date Imputation

Many SDTM domains require complete dates, but raw data often contains partial or missing dates. Use the following code snippet for date imputation:

                
data imputed_dates;
    set raw_data;
    /* Impute missing day to the first day of the month */
    if length(strip(date)) = 7 then date = cats(date, '-01');
    /* Impute missing month and day to January 1st */
    else if length(strip(date)) = 4 then date = cats(date, '-01-01');
    format date yymmdd10.;
run;
                
            

Tip: Always document the imputation logic and ensure it aligns with the study protocol.

Tip 3: Dynamic Variable Label Assignment

Avoid hardcoding labels when creating SDTM domains. Use metadata-driven programming for consistency:

                
data AE;
    set raw_ae;
    attrib
        AESTDTC label="Start Date/Time of Adverse Event"
        AEENDTC label="End Date/Time of Adverse Event";
run;
                
            

Tip: Store labels in a metadata file (e.g., Excel or CSV) and read them dynamically in your program.

Tip 4: Efficient Use of Pinnacle 21 Outputs

Pinnacle 21 validation reports can be overwhelming. Focus on the following key areas:

  • Major Errors: Address structural and required variable issues first.
  • Traceability: Ensure SUPPQUAL variables and parent records are linked correctly.
  • Controlled Terminology: Verify values against the CDISC CT library to avoid deviations.

Tip: Use Excel formulas or conditional formatting to prioritize findings in Pinnacle 21 reports.

Tip 5: Debugging Complex Mapping Issues

When debugging mapping logic, use PUTLOG statements strategically:

                
data SDTM_AE;
    set raw_ae;
    if missing(AEDECOD) then putlog "WARNING: Missing AEDECOD for USUBJID=" USUBJID;
run;
                
            

Tip: Use PUTLOG with conditions to reduce unnecessary log clutter.

Tip 6: Mapping RELREC Domain

The RELREC domain is used to define relationships between datasets. Automate its creation using a data-driven approach:

                
data RELREC;
    set parent_data;
    RELID = "REL1";
    RDOMAIN1 = "AE"; USUBJID1 = USUBJID; IDVAR1 = "AESEQ"; IDVARVAL1 = AESEQ;
    RDOMAIN2 = "CM"; USUBJID2 = USUBJID; IDVAR2 = "CMSEQ"; IDVARVAL2 = CMSEQ;
    output;
run;
                
            

Tip: Validate RELREC with Pinnacle 21 to ensure all relationships are correctly represented.

Tip 7: Using PROC DATASETS for Efficiency

Leverage PROC DATASETS for efficient dataset management:

                
                
proc datasets lib=work nolist;
    modify AE;
        label AESTDTC = "Start Date/Time of Adverse Event"
              AEENDTC = "End Date/Time of Adverse Event";
    run;
quit;
                
            

Tip: Use PROC DATASETS to modify attributes like labels, formats, and lengths without rewriting the dataset.

Tip 8: Deriving Epoch Variables

EPOCH is a critical variable in SDTM domains, representing the study period during which an event occurred. Automate its derivation as follows:

                
data AE;
    set AE;
    if AESTDTC >= TRTSDTC and AESTDTC <= TRTEDTC then EPOCH = "TREATMENT";
    else if AESTDTC < TRTSDTC then EPOCH = "SCREENING";
    else if AESTDTC > TRTEDTC then EPOCH = "FOLLOW-UP";
run;
                
            

Tip: Ensure EPOCH values are consistent with the study design and align with other SDTM domains like EX and SV.

Tip 9: Validating VISITNUM and VISIT Variables

VISITNUM and VISIT are critical for aligning events with planned visits. Use a reference table for consistency:

                
proc sql;
    create table validated_data as
    select a.*, b.VISIT
    from raw_data a
    left join visit_reference b
    on a.VISITNUM = b.VISITNUM;
quit;
                
            

Tip: Cross-check derived VISITNUM and VISIT values against the Trial Design domains (e.g., TV and TA).

Tip 10: Generating Define.XML Annotations

Define.XML is a crucial deliverable for SDTM datasets. Use metadata to dynamically create annotations:

                
data define_annotations;
    set metadata;
    xml_annotation = cats("<ItemDef OID='IT.", name, "' Name='", name, 
                          "' Label='", label, "' DataType='", type, "'/>");
run;

proc print data=define_annotations noobs; run;
                
            

Tip: Validate the Define.XML file using tools like Pinnacle 21 or XML validators to ensure compliance.

Written by Sarath Annapareddy | For more SDTM tips, stay tuned!

Advanced SDTM Programming Tips: Automating SUPPQUAL Domain Creation

Advanced SDTM Programming Tip: Automating SUPPQUAL Domain Creation

Advanced SDTM Programming Tip: Automating SUPPQUAL Domain Creation

Optimize Your SDTM Workflows with Efficient Automation Techniques

Introduction to SUPPQUAL Automation

The SUPPQUAL (Supplemental Qualifiers) domain is used to store additional information that cannot fit within a standard SDTM domain. Manually creating the SUPPQUAL domain can be time-consuming and error-prone, especially for large datasets. In this article, we’ll explore an advanced tip to automate its creation using SAS macros.

Use Case: Adding Supplemental Qualifiers to a Domain

Imagine you have an SDTM AE domain (Adverse Events) and need to capture additional details like the investigator’s comments or assessment methods that are not part of the standard AE domain.

Code Example: Automating SUPPQUAL Domain

                
/* Macro to Create SUPPQUAL Domain */
%macro create_suppqual(domain=, idvar=, qnam_list=);
    %let domain_upper = %upcase(&domain);
    %let suppqual = SUPP&domain_upper;

    data &suppqual;
        set &domain;
        length RDOMAIN $8 IDVAR $8 QNAM $8 QLABEL $40 QVAL $200;
        array qvars{*} &qnam_list;
        do i = 1 to dim(qvars);
            if not missing(qvars{i}) then do;
                RDOMAIN = "&domain_upper";
                USUBJID = USUBJID;
                IDVAR = "&idvar";
                IDVARVAL = &idvar;
                QNAM = vname(qvars{i});
                QLABEL = put(QNAM, $40.);
                QVAL = strip(put(qvars{i}, $200.));
                output;
            end;
        end;
        drop i &qnam_list;
    run;

    /* Sort SUPPQUAL for submission readiness */
    proc sort data=&suppqual;
        by USUBJID RDOMAIN IDVAR IDVARVAL QNAM;
    run;
%mend;

/* Example Usage: Automating SUPPAE */
%create_suppqual(domain=AE, idvar=AETERM, qnam_list=AECOMMENT AEASSESS);
                
            

Explanation of the Code

  • RDOMAIN: Captures the parent domain name (e.g., AE).
  • array qvars{*}: Iterates through the list of supplemental qualifiers provided as macro parameters.
  • IDVAR: Represents the key variable in the parent domain (e.g., AETERM).
  • QLABEL: Automatically assigns a label to the qualifier variable.
  • QVAL: Stores the actual value of the supplemental qualifier.

Advantages of This Approach

  • Eliminates manual effort in creating SUPPQUAL domains.
  • Highly reusable and scalable across different domains.
  • Ensures consistency in handling supplemental qualifiers.

Pro Tip: Validation and Quality Control

Always validate the output SUPPQUAL dataset against CDISC compliance rules using tools like Pinnacle 21. Ensure that all required columns and relationships are correctly populated.

Written by Sarath Annapareddy | For more SDTM tips, stay connected!

Hash Objects

Advanced SAS Programming Tip: Using HASH Objects

Advanced SAS Programming Tip: Using HASH Objects

Unlock the Power of SAS for Efficient Data Manipulation

Introduction to HASH Objects

In SAS, HASH objects provide an efficient way to perform in-memory data lookups and merge operations, especially when dealing with large datasets. Unlike traditional joins using PROC SQL or the MERGE statement, HASH objects can significantly reduce computational overhead.

Use Case: Matching and Merging Large Datasets

Suppose you have two datasets: a master dataset containing millions of records and a lookup dataset with unique key-value pairs. The goal is to merge these datasets without compromising performance.

Code Example: Using HASH Objects

                
/* Define the master and lookup datasets */
data master;
    input ID $ Value1 $ Value2 $;
    datalines;
A001 X1 Y1
A002 X2 Y2
A003 X3 Y3
;
run;

data lookup;
    input ID $ LookupValue $;
    datalines;
A001 L1
A002 L2
A003 L3
;
run;

/* Use HASH object to merge datasets */
data merged;
    if _n_ = 1 then do;
        declare hash h(dataset: "lookup");
        h.defineKey("ID");
        h.defineData("LookupValue");
        h.defineDone();
    end;

    set master;
    if h.find() = 0 then output;
run;

/* Display the merged data */
proc print data=merged;
run;
                
            

Explanation of the Code

  • declare hash h: Creates a HASH object and loads the lookup dataset into memory.
  • h.defineKey: Specifies the key variable (ID) for the lookup.
  • h.defineData: Identifies the variable to retrieve from the lookup dataset.
  • h.find(): Searches for a match in the HASH object and retrieves the data if found.

Advantages of HASH Objects

  • Faster lookups compared to traditional joins, especially with large datasets.
  • In-memory operations reduce I/O overhead.
  • Provides greater flexibility for advanced operations.

Written by Sarath Annapareddy | For more SAS tips, stay tuned!

Advanced SAS Programming Tip: Mastering Macro Variables

Advanced SAS Programming Tip: Mastering Macro Variables

Advanced SAS Programming Tip: Mastering Macro Variables

Unleash the power of SAS with this advanced technique.

Introduction

Macro variables are a powerful tool in SAS that allow you to dynamically generate code. By understanding and effectively using macro variables, you can write more efficient and flexible SAS programs.

The Basics of Macro Variables

A macro variable is a placeholder that is replaced with its value during macro processing. You define a macro variable using the %LET statement and reference it using the %SYSFUNC or %SYSEVALF functions.

Advanced Techniques

1. Conditional Logic

You can use the %IF-%THEN-%ELSE statements to create conditional logic within your macro code. This allows you to dynamically generate code based on specific conditions.

2. Iterative Processing

The %DO loop can be used to iterate over a range of values or a list of items. This is useful for repetitive tasks, such as generating multiple datasets or reports.

3. Custom Macro Functions

You can create your own custom macro functions to encapsulate complex logic and reuse it throughout your code. This can help to improve code readability and maintainability.

Example: Dynamically Generating SQL Queries

Here's a simple example of how to use macro variables to dynamically generate SQL queries:

```sas %let table_name = my_data; %let where_clause = age > 30; proc sql; select * from &table_name where &where_clause; quit; ```

Conclusion

By mastering macro variables, you can take your SAS programming skills to the next level. Experiment with these techniques to create more powerful and efficient SAS programs.

© Sarath

Tuesday, November 19, 2024

CMENRTPT vs CMENRF in SDTM

CMENRTPT vs CMENRF in SDTM

Understanding CMENRTPT vs CMENRF in SDTM

By Sarath

Introduction

When working with the Concomitant Medication (CM) domain in SDTM, it's crucial to understand how timing variables like CMENRTPT (End Relative to Reference Time Point) and CMENRF (End Relative to Reference Period) differ and when to use each.

What is CMENRTPT?

CMENRTPT indicates the relationship between the end of the medication and a specific time point, such as the start of treatment or a significant event (e.g., surgery).

  • Controlled Terminology: BEFORE, AFTER, ONGOING, CONCURRENT
  • Example: If a medication was stopped before surgery, CMENRTPT = "BEFORE".

What is CMENRF?

CMENRF describes whether the medication ended before, during, or after a specific study period, such as Screening, Treatment, or Follow-up.

  • Controlled Terminology: BEFORE, DURING, AFTER
  • Example: If a medication ended during the treatment phase, CMENRF = "DURING".

Key Differences

Aspect CMENRTPT CMENRF
Focus Relationship to a specific time point Relationship to a study period
Granularity More precise Broader
Typical Use Case Linking to an event like surgery Contextualizing within a study phase

Example Scenario

Scenario: A medication was stopped two days after surgery, which occurred during the screening phase.

  • CMENRTPT: AFTER (relative to surgery)
  • CMENRF: BEFORE (relative to the start of the treatment phase)

Conclusion

By understanding the differences between CMENRTPT and CMENRF, clinical programmers can ensure accurate and standardized representation of concomitant medication timing in SDTM datasets. Always refer to the study protocol and CDISC standards to select the appropriate variable for your data.

© 2024 Sarath. All rights reserved.

Resolving the SAS EG Transcoding Error

Resolving the SAS EG Transcoding Error

Addressing the "Character Data Lost During Transcoding" Issue in SAS EG

Author: Sarath

Date: November 19, 2024

Introduction

While working in SAS Enterprise Guide (SAS EG), you may encounter the error: "Some character data was lost during transcoding in the dataset." This issue typically arises when character data contains unsupported characters or is truncated due to insufficient column lengths. In this blog post, we'll explore the root causes and provide step-by-step solutions.

Common Causes

  • Unsupported Characters: The data contains special or non-ASCII characters not representable in the session encoding.
  • Truncation: Character variables are too short to store the full data, leading to loss of information.
  • Encoding Mismatch: The dataset's encoding differs from the SAS session's encoding.

Step-by-Step Solutions

1. Check Encoding

Identify the encoding of your SAS session and dataset:

proc options option=encoding; run;
proc contents data=tempdata.cm; run;

2. Identify Problematic Characters

Review a sample of the dataset to locate non-representable or truncated characters:

proc print data=tempdata.cm (obs=50); run;

3. Use a Compatible Encoding

Adjust the encoding for your session or dataset. For example, specify UTF-8 if working with multilingual data:

libname tempdata 'path-to-data' inencoding='utf-8';

4. Increase Column Lengths

Expand the length of character variables to avoid truncation:

data tempdata.cm;
    set tempdata.cm;
    length newvar $200; /* Adjust length */
    newvar = oldvar;
run;

5. Transcode the Dataset

Convert the dataset into a compatible encoding:

libname tempdata_in 'path-to-input-data' inencoding='utf-8';
libname tempdata_out 'path-to-output-data' encoding='utf-8';

data tempdata_out.cm;
    set tempdata_in.cm;
run;

6. Modify Encoding with PROC DATASETS

Repair the dataset’s encoding directly:

proc datasets lib=tempdata;
    modify cm / correctencoding='utf-8';
quit;

7. Clean the Data

Handle non-printable or invalid characters using functions like KCOMPRESS or KSTRIP.

Best Practices

  • Ensure consistent encoding between your data sources and SAS session.
  • Use UTF-8 encoding for handling multilingual data.
  • Allocate sufficient column lengths for character variables during data transformation.

Conclusion

Resolving transcoding errors in SAS EG requires identifying the root cause and applying the appropriate solution. By following the steps outlined above, you can ensure your character data is correctly processed without loss or truncation.

Have questions or insights? Share them in the comments below!

Sunday, November 3, 2024

Advanced SAS programming Techniques for SDTM implementation

Advanced SAS Programming Techniques for SDTM Implementation

Date: November 3, 2024

In the realm of clinical trials data management, SDTM (Study Data Tabulation Model) implementation requires sophisticated programming techniques to ensure data accuracy and compliance. This article explores advanced SAS programming methods that can streamline SDTM dataset creation and validation.

1. Efficient Variable Derivation Using Hash Objects

Hash objects in SAS provide a powerful way to perform quick lookups and merges, especially useful when dealing with large SDTM datasets.

data work.ae;
if _n_ = 1 then do;
declare hash h_dm(dataset: "sdtm.dm");
h_dm.definekey("usubjid");
h_dm.definedata("age", "sex", "race");
h_dm.definedone();
end;
set raw.ae;
rc = h_dm.find();
/* Continue processing */
run;

Pro Tip: Hash objects remain in memory throughout the DATA step, making them more efficient than traditional merge operations for large datasets.

2. Standardizing Controlled Terminology with Format Catalogs

Creating and maintaining CDISC-compliant terminology is crucial for SDTM implementation.

proc format library=library.sdtm_formats;
value $severity
'MILD' = 'MILD'
'MOD' = 'MODERATE'
'MODERATE' = 'MODERATE'
'SEV' = 'SEVERE'
'SEVERE' = 'SEVERE'
other = 'UNKNOWN';
run;

data sdtm.ae;
set work.ae;
aesev = put(raw_severity, $severity.);
run;

3. Macro Systems for Dynamic SDTM Generation

Developing reusable macro systems can significantly improve efficiency and reduce errors in SDTM implementation.

%macro create_supp(domain=, vars=);
proc sql noprint;
select distinct usubjid, &vars
into :subjids separated by ',',
:values separated by ','
from sdtm.&domain;
quit;

data sdtm.supp&domain;
set sdtm.&domain(keep=usubjid &vars);
length qnam $8 qlabel $40 qval $200;
/* Generate supplemental qualifiers */
run;
%mend create_supp;

4. Advanced Error Checking and Validation

Implementing robust error-checking mechanisms ensures data quality and compliance with SDTM standards.

%macro validate_domain(domain=);
proc sql noprint;
/* Check for duplicate records */
create table work.duplicates as
select *, count(*) as count
from sdtm.&domain
group by usubjid, &domain.dtc
having count > 1;
/* Verify required variables */
select name into :reqvars separated by ' '
from sashelp.vcolumn
where libname='SDTM' and memname=upcase("&domain")
and name in ('USUBJID', 'DOMAIN', "&domain.SEQ");
quit;
%mend validate_domain;

5. Handling Custom Domains and Extensions

Sometimes, standard SDTM domains need to be extended to accommodate study-specific requirements.

proc sql;
create table sdtm.custom_domain as
select a.usubjid,
a.visit,
b.startdt,
calculated enddt format=datetime20.
from derived.custom_data as a
left join sdtm.sv as b
on a.usubjid = b.usubjid
and a.visit = b.visit;
quit;

6. Optimizing Performance for Large Studies

When dealing with large studies, performance optimization becomes crucial:

  1. Use WHERE clauses instead of IF statements when possible
  2. Implement parallel processing for independent domains
  3. Optimize sort operations using PROC SORT NODUPKEY
options mprint mlogic symbolgen;
%let parallel_domains = ae cm eg lb mh vs;

%macro process_domains;
%do i = 1 %to %sysfunc(countw(¶llel_domains));
%let domain = %scan(¶llel_domains, &i);
%submit;
%create_domain(domain=&domain)
%endsubmit;
%end;
%mend process_domains;

Best Practice: Always document your code thoroughly and include version control information for traceability.

Conclusion

Mastering these advanced SAS programming techniques can significantly improve the efficiency and quality of SDTM implementation. Remember to always validate your outputs against SDTM Implementation Guide requirements and maintain clear documentation of your programming decisions.

Monday, October 21, 2024

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

Harnessing the Power of CALL EXECUTE in SAS for Dynamic Code Execution

As SAS programmers, we often encounter situations where we need to execute a certain procedure or set of steps multiple times, typically based on different subsets of data. Manually writing out code for each instance can be time-consuming, but SAS offers a powerful tool to make this process more efficient: CALL EXECUTE.

What is CALL EXECUTE?

CALL EXECUTE is a SAS routine that allows you to dynamically generate and execute SAS code during a data step’s execution. Instead of hardcoding the logic for every individual case, CALL EXECUTE can generate the code on the fly and execute it as part of the same data step. This technique is invaluable when you have repetitive tasks across different datasets, procedures, or even report generation.

Basic Example: Dynamic PROC PRINT Execution

Let's say you have multiple datasets in the WORK library, and you want to run a PROC PRINT for each dataset. Instead of manually writing a PROC PRINT for each one, you can use CALL EXECUTE to automate this process:

proc sql;
    select cat('proc print data=', libname, '.', memname, '; run;')
        into :code_list separated by ' ' 
    from sashelp.vtable 
    where libname='WORK';
quit;

data _null_;
    call execute("&code_list");
run;

This code does the following:

  • The PROC SQL step queries the SAS dictionary table sashelp.vtable to generate a list of all datasets in the WORK library. It concatenates each dataset name into a PROC PRINT statement and stores them in the macro variable code_list.
  • The CALL EXECUTE routine inside the DATA _NULL_ step dynamically executes each PROC PRINT statement, printing each dataset without manual intervention.

Benefits of Using CALL EXECUTE

The ability to dynamically generate and execute code gives you tremendous flexibility. Here are some key benefits:

  • Automated Task Execution: Use CALL EXECUTE to run procedures on multiple datasets, making automation easier in iterative tasks like generating reports.
  • Reduced Code Duplication: Eliminate the need to manually write out repetitive code, making your programs cleaner and more maintainable.
  • Increased Flexibility: Dynamically adjust the logic based on changing data or parameters without modifying the core program.

Advanced Example: Conditional Execution of Procedures

In some cases, you might want to execute different procedures based on the content of the data. Here's an example where we execute PROC FREQ if a dataset contains a categorical variable, and PROC MEANS if it contains a numeric variable:

data _null_;
    set sashelp.vcolumn(where=(libname='WORK'));
    if type = 'char' then 
        call execute('proc freq data=work.' || trim(memname) || '; tables ' || name || '; run;');
    else if type = 'num' then 
        call execute('proc means data=work.' || trim(memname) || '; var ' || name || '; run;');
    run;

In this code:

  • The sashelp.vcolumn table provides information about the columns in each dataset, including the variable type (character or numeric).
  • Depending on the variable type, CALL EXECUTE runs either PROC FREQ for categorical data or PROC MEANS for numeric data.

Conclusion

Using CALL EXECUTE in SAS is an efficient way to dynamically generate and execute code, particularly in situations that involve repetitive tasks. Whether you’re working on large datasets or need to run different procedures conditionally, CALL EXECUTE can significantly simplify your workflow and reduce manual intervention. Mastering this tool will help make your SAS programming more efficient and flexible.

Have you used CALL EXECUTE in your SAS programs? Share your experiences in the comments below!

Thursday, October 10, 2024

Finding Duplicate Records Across SAS Datasets in an Entire Library

Finding Duplicate Records Across SAS Datasets in an Entire Library

Finding Duplicate Records Across SAS Datasets in an Entire Library

Author: Sarath

Date: October 10, 2024

Introduction

In SAS programming, identifying and managing duplicate records within datasets is an essential part of data cleaning and quality control. However, when working with multiple datasets in an entire library, the task of finding duplicates becomes more complex. In this article, we will explore different ways to identify duplicate records in SAS datasets across an entire library using several approaches: PROC SORT, PROC SQL, and DATA STEP. Additionally, we will provide advanced techniques to streamline the process for multiple datasets in a library.

Why Identify Duplicates?

Duplicate records can cause significant issues in data analysis, leading to inaccurate results, inflated counts, or incorrect statistical conclusions. Detecting and addressing duplicates in your SAS datasets ensures data integrity and improves the reliability of your analysis. By finding and removing duplicates, you can avoid skewed reports and maintain high-quality datasets for regulatory or research purposes.

Methods to Find Duplicate Records in SAS

SAS offers several methods to identify duplicates. The most common approaches involve using PROC SORT with the NODUPKEY or NODUP options, PROC SQL with grouping and counting, and DATA STEP with conditional logic. Let's explore each approach and how they can be used for individual datasets as well as across an entire library.

1. Using PROC SORT

The PROC SORT procedure is one of the simplest and most common ways to identify duplicate records in a SAS dataset. You can use the NODUPKEY option to eliminate duplicates based on specific key variables. Here's an example:

proc sort data=mylib.dataset1 nodupkey;
    by key_variable1 key_variable2;
run;
        

This code will remove duplicate records based on the values of key_variable1 and key_variable2 from the dataset dataset1. However, it doesn’t tell you which records were duplicates. To retain duplicates in a separate dataset for further analysis, you can use the OUT= option and keep the duplicate records:

proc sort data=mylib.dataset1 out=duplicates nodupkey;
    by key_variable1 key_variable2;
run;

data duplicates_found;
    merge mylib.dataset1(in=a) duplicates(in=b);
    by key_variable1 key_variable2;
    if a and not b; /* Keep only duplicates */
run;
        

2. Using PROC SQL

PROC SQL is a powerful way to find duplicate records using SQL queries. You can group data by key variables and use the COUNT function to identify duplicates:

proc sql;
    create table duplicates as
    select key_variable1, key_variable2, count(*) as freq
    from mylib.dataset1
    group by key_variable1, key_variable2
    having freq > 1;
quit;
        

This code will create a table of duplicate records where the combination of key_variable1 and key_variable2 appears more than once in the dataset. PROC SQL is versatile, and you can easily extend it to work across multiple datasets or libraries.

3. Using DATA STEP

The DATA STEP method offers more flexibility when identifying duplicates. You can flag duplicate records and create a separate dataset containing only the duplicates:

data duplicates_found;
    set mylib.dataset1;
    by key_variable1 key_variable2;
    if first.key_variable1 and last.key_variable1 then delete;
run;
        

This example flags duplicates using the BY statement and the FIRST. and LAST. indicators. If a record is the first and last occurrence of a key combination, it is unique, and thus deleted. All other records are duplicates.

Advanced Techniques: Finding Duplicates in an Entire Library

When working with an entire library of datasets, manually checking each dataset for duplicates can be time-consuming. SAS provides efficient ways to automate this process, especially when dealing with a large number of datasets.

1. Using Macro Loops

A SAS macro can be used to loop through all datasets in a library and check for duplicates. Here's an example of how to automate this using PROC SQL and a macro:

%macro find_dups(lib=);
    /* Retrieve all dataset names in the library */
    proc sql noprint;
        select memname into :dslist separated by ' '
        from dictionary.tables
        where libname = upcase("&lib");
    quit;

    /* Loop through each dataset and find duplicates */
    %let count = 1;
    %let dsname = %scan(&dslist, &count);
    %do %while (&dsname ne );
        proc sql;
            create table duplicates_&dsname as
            select key_variable1, key_variable2, count(*) as freq
            from &lib..&dsname
            group by key_variable1, key_variable2
            having freq > 1;
        quit;

        %let count = %eval(&count + 1);
        %let dsname = %scan(&dslist, &count);
    %end;
%mend find_dups;

/* Call the macro to find duplicates in all datasets */
%find_dups(lib=mylib);
        

This macro loops through every dataset in the library mylib, checks for duplicates based on key variables, and creates a separate dataset with duplicates for each dataset processed. This method saves time by automating the search across all datasets.

2. Using Dictionary Tables

Another approach to process datasets across an entire library is by leveraging SAS dictionary tables. Dictionary tables store metadata about your SAS environment, including information about datasets in a library. Here's an example of how to use dictionary tables to find duplicate records:

proc sql;
    select memname into :dslist separated by ' '
    from dictionary.tables
    where libname='MYLIB';
quit;

%macro find_duplications;
    %do i=1 %to %sysfunc(countw(&dslist));
        %let dsname=%scan(&dslist, &i);

        proc sort data=mylib.&dsname out=dup_&dsname nodupkey;
            by key_variable1 key_variable2;
        run;

        /* Check for duplicates in the sorted dataset */
        data check_dup_&dsname;
            merge mylib.&dsname(in=a) dup_&dsname(in=b);
            by key_variable1 key_variable2;
            if a and not b;
        run;
    %end;
%mend;

%find_duplications;
        

This macro uses dictionary tables to get a list of datasets and then applies a sort and merge process to find duplicates. It produces a dataset for each original dataset that contains only the duplicate records.

Conclusion

Identifying duplicate records across multiple datasets in a library is a common task in data quality control. Using techniques like PROC SORT, PROC SQL, and DATA STEP, you can efficiently find and handle duplicates. For larger projects involving multiple datasets, leveraging macros and dictionary tables allows you to automate the process, saving time and reducing errors.

These methods not only help in managing duplicates but also ensure the integrity of your data, leading to more accurate and reliable results in your analysis. Incorporate these techniques into your workflow to improve your data management practices.

Have questions or additional tips on finding duplicates in SAS datasets? Share your thoughts in the comments below!

Comprehensive Guide to Define.xml Package Generation and QC Process

Comprehensive Guide to Define.xml Package Generation and QC Process

Comprehensive Guide to Define.xml Package Generation and QC Process

Author: Sarath

Date: October 10, 2024

Introduction

The Define.xml file, also known as the Case Report Tabulation Data Definition (CRT-DD), is a key component in regulatory submissions for clinical trials. It describes the metadata for the datasets submitted to regulatory agencies such as the FDA and EMA, providing transparency and traceability for clinical trial data. In this post, we’ll explore both the steps involved in generating the Define.xml package and the necessary Quality Control (QC) process to ensure its accuracy and compliance with regulatory requirements.

What is Define.xml and Why Is It Important?

The Define.xml file serves as the metadata backbone for clinical trial datasets submitted for regulatory review. It describes the structure and relationships of the datasets, variables, controlled terminologies, and derivations in the submission. Regulatory reviewers rely on the Define.xml file to understand the data, its origins, and how derived variables were created. A well-constructed Define.xml file ensures smooth data review and promotes transparency.

The Define.xml is mandatory for submissions following CDISC (Clinical Data Interchange Standards Consortium) standards, such as SDTM (Study Data Tabulation Model) and ADaM (Analysis Data Model) datasets.

Steps for Define.xml Package Generation

1. Metadata Preparation

The first step is to prepare the metadata for all datasets and variables included in the submission. This includes:

  • Dataset metadata: The names, labels, and descriptions of each dataset.
  • Variable metadata: Details for each variable, including its name, type (character or numeric), length, format, controlled terminologies (if applicable), and derivations.
  • Value-level metadata: When applicable, value-level metadata is necessary for variables that may take different attributes based on specific values.

This metadata is often compiled in spreadsheets or specialized data definition tables within your programming environment.

2. Controlled Terminology Setup

Controlled terminology plays a crucial role in ensuring that values used in datasets are standardized. For example, MedDRA (Medical Dictionary for Regulatory Activities) is commonly used for adverse event terms, while CDISC-controlled terminology is used for other data points. Ensure that your controlled terminology is up-to-date with the latest regulatory requirements.

3. Defining Derivation Rules

All derived variables should be clearly documented, including how they were calculated or derived from other variables in the dataset. This step ensures that the regulatory agency understands how complex variables were generated and can trace them back to their raw origins.

4. Generate Define.xml File Using Tools

Tools like Pinnacle 21 or OpenCDISC can be used to generate the Define.xml file from the prepared metadata. These tools automate the conversion of metadata into the XML format required by regulatory agencies. Here’s how the generation process typically works:

  • Input your metadata into the tool (often via Excel spreadsheets or metadata tables).
  • The tool generates the Define.xml file and any associated codelist files.
  • The output is an XML file that can be submitted along with the clinical datasets.

5. Assemble the Define.xml Package

The complete Define.xml package includes:

  • Define.xml file
  • Annotated CRF (Case Report Form)
  • Study Data Reviewer’s Guide (SDRG) and Analysis Data Reviewer’s Guide (ADRG), if applicable

Ensure all necessary documentation is compiled as part of the submission package.

Quality Control (QC) Process for Define.xml

Once the Define.xml file is generated, it must undergo a rigorous QC process to ensure compliance with CDISC standards and avoid issues during regulatory review. Below are the key steps in the QC process:

1. Validate Using Pinnacle 21

One of the most important QC steps is to validate the Define.xml file using a tool like Pinnacle 21. This tool checks your file against CDISC standards and provides a report highlighting any potential errors or warnings. Some common issues that are flagged during validation include:

  • Missing or incorrect metadata
  • Inconsistencies in variable attributes (e.g., variable length or type)
  • Unreferenced codelists or controlled terminologies

Always review the validation report carefully and resolve any issues before submission.

2. Cross-Check Metadata Against Raw Data

A crucial aspect of QC is to cross-check the metadata in the Define.xml file against the raw and derived datasets. This ensures that the variable names, labels, and formats specified in the metadata align with the actual datasets submitted. Common checks include:

  • Are the variable names and labels consistent between the datasets and the Define.xml file?
  • Do the controlled terminologies used match those in the datasets?
  • Are the derivations correctly documented and traceable?

3. Check for Completeness and Accuracy

Ensuring completeness is critical. Each dataset, variable, codelist, and derivation that is part of your submission must be documented in the Define.xml. Missing or incomplete metadata can lead to delays in regulatory review. During QC, verify the following:

  • Every dataset and variable is present in the Define.xml file.
  • All codelists are correctly referenced, and their values match the dataset contents.
  • Derived variables have clear and complete descriptions of how they were calculated.

4. Verify Value-Level Metadata (If Applicable)

For variables that require value-level metadata (e.g., variables that behave differently based on their values), verify that the detailed metadata is present and correct. Ensure that any conditions described for value-level metadata accurately reflect the dataset contents.

5. Manual Review of XML File

While automated tools like Pinnacle 21 are invaluable, it is also important to perform a manual review of the XML file. Open the Define.xml file in a text editor or XML viewer and check for any formatting issues, such as missing tags or improperly nested elements.

6. Documentation and Sign-Off

Once the QC process is complete and all issues have been resolved, document the QC activities. This can include a QC checklist or summary that describes the steps taken to validate the file. Obtain sign-off from team members or stakeholders to confirm that the Define.xml file is ready for submission.

Common Pitfalls and How to Avoid Them

Below are some common pitfalls encountered during Define.xml generation and QC, along with tips on how to avoid them:

  • Outdated Controlled Terminology: Ensure you’re using the most up-to-date versions of controlled terminologies (e.g., MedDRA, CDISC).
  • Inconsistent Metadata: Cross-check metadata between the Define.xml file and datasets to prevent mismatches.
  • Missing Documentation: Don’t overlook the need for additional documents like the Annotated CRF and Reviewer’s Guide.
  • Overlooking Value-Level Metadata: If required, always include value-level metadata and double-check its accuracy.
  • Skipping Manual Review: While validation tools are helpful, always conduct a manual review of the XML file to catch formatting issues that may not be flagged by automated tools.

Conclusion

Generating and validating a Define.xml package is a critical part of clinical trial submissions. By following a structured approach to both generation and QC, you can ensure your submission meets regulatory standards and avoid potential delays during the review process. Always use tools like Pinnacle 21 for validation, but don’t forget the importance of manual review and cross-checking metadata for completeness and accuracy.

Investing time in the QC process is essential for a successful submission, as a properly validated Define.xml file can facilitate faster and smoother regulatory review. Incorporate these best practices into your workflow to ensure compliance and to enhance the quality of your submissions.

Have questions or additional insights on Define.xml generation and QC? Share your thoughts in the comments below!

Learn how to view SAS dataset labels without opening the dataset directly in a SAS session. Easy methods and examples included!

Quick Tip: See SAS Dataset Labels Without Opening the Data Quick Tip: See SAS Dataset Labels With...