Sunday, September 1, 2024

Understanding RFXSTDTC and RFSTDTC in the Demographics (DM) Domain

Understanding RFXSTDTC and RFSTDTC in the Demographics (DM) Domain

Understanding RFXSTDTC and RFSTDTC in the Demographics (DM) Domain

Introduction

In the context of clinical trials, accurately capturing key dates related to subject participation is critical for understanding the timeline of the study. The SDTM (Study Data Tabulation Model) Demographics (DM) domain includes several variables that record these key dates, two of the most important being RFXSTDTC and RFSTDTC. Although they may seem similar, these variables have distinct meanings and uses. This article explains the difference between RFXSTDTC and RFSTDTC, with detailed examples to illustrate their appropriate use.

Definitions

RFSTDTC (Reference Start Date/Time of Study Participation)

RFSTDTC refers to the date and time when the subject officially started participating in the study. This is usually the date of randomization, the first study-specific procedure, or the date when the subject provided informed consent, depending on the study design.

RFXSTDTC (Date/Time of First Study Treatment)

RFXSTDTC captures the date and time when the subject received their first dose of the study treatment. This date is specifically linked to the intervention being tested in the study and marks the beginning of the subject’s exposure to the treatment.

Detailed Example

Let’s consider a clinical trial where subjects are required to give informed consent, undergo randomization, and then receive the study treatment. The timeline for each subject might look like this:

Subject ID Informed Consent Date Randomization Date First Study Drug Dose Date RFSTDTC RFXSTDTC
001 2024-01-01 2024-01-05 2024-01-10 2024-01-05 2024-01-10
002 2024-01-02 2024-01-06 2024-01-08 2024-01-06 2024-01-08
003 2024-01-03 2024-01-07 2024-01-12 2024-01-07 2024-01-12

Explanation

  • Subject 001:
    • RFSTDTC = 2024-01-05: This date represents when the subject was randomized, marking the official start of their participation in the study.
    • RFXSTDTC = 2024-01-10: This date indicates when the subject received their first dose of the study drug.
  • Subject 002:
    • RFSTDTC = 2024-01-06: The date of randomization, indicating the start of study participation.
    • RFXSTDTC = 2024-01-08: The date when the subject first received the study drug.
  • Subject 003:
    • RFSTDTC = 2024-01-07: The randomization date, marking the start of the subject’s participation.
    • RFXSTDTC = 2024-01-12: The date when the subject received the first dose of the study drug.

Key Differences

The key difference between RFSTDTC and RFXSTDTC lies in what they represent:

  • RFSTDTC is focused on the start of the subject’s participation in the study, often marked by randomization or the first study-specific procedure.
  • RFXSTDTC specifically tracks when the subject first receives the study treatment, marking the start of their exposure to the intervention being tested.

Why This Distinction Matters

Accurately capturing these dates is crucial for the integrity of the study data. The distinction between RFSTDTC and RFXSTDTC helps in:

  • Analyzing Study Timelines: Researchers can distinguish between when a subject officially became part of the study and when they actually started receiving treatment.
  • Regulatory Compliance: Accurate records of participation and treatment initiation are critical for meeting regulatory requirements and ensuring the study's validity.
  • Study Integrity: Differentiating between these dates allows for precise tracking of subject progress and adherence to the study protocol.

Conclusion

Understanding the difference between RFSTDTC and RFXSTDTC is essential for correctly managing and analyzing clinical trial data. While both variables are related to key dates in a subject’s journey through the trial, they capture different aspects of participation and treatment. Proper use of these variables ensures that the study’s timeline is accurately documented, contributing to the overall integrity and reliability of the clinical trial data.

If you have any further questions or need additional examples, feel free to ask!

SAS Enterprise Guide (SAS EG) Tips and Techniques

SAS Enterprise Guide (SAS EG) Tips and Techniques

SAS Enterprise Guide (SAS EG) Tips and Techniques

Introduction

SAS Enterprise Guide (SAS EG) is a powerful graphical user interface that allows users to harness the full power of SAS without needing to write code. It’s particularly useful for those who prefer a point-and-click approach to data manipulation, analysis, and reporting. In this report, we’ll explore several tips and techniques to maximize your productivity with SAS EG, accompanied by examples to demonstrate how to apply these techniques in practice.

Tip 1: Utilize Task Templates for Repetitive Work

Task templates in SAS EG allow you to save frequently used settings and tasks, which can be reused in future projects. This is particularly useful when you have standardized procedures that you need to apply across multiple datasets or projects.

Example:

Suppose you regularly produce frequency distributions for different datasets. Instead of setting up the task from scratch each time, you can create a task template:

  1. Set up a frequency distribution task for your dataset.
  2. Configure the task options, such as choosing the variables and formatting the output.
  3. Right-click on the task in the process flow and select Save as Task Template.
  4. In the future, you can simply apply this template to any dataset, saving you time and ensuring consistency.

Tip 2: Automate Processes with Project Flows

Project flows in SAS EG allow you to automate complex sequences of tasks, making your analysis more efficient and less prone to errors. By linking tasks in a process flow, you can ensure that they are executed in the correct order without manual intervention.

Example:

Imagine you have a series of data preparation tasks, followed by analysis and reporting:

  1. Import raw data into SAS EG.
  2. Clean and transform the data (e.g., handle missing values, standardize formats).
  3. Perform statistical analysis (e.g., regression analysis, summary statistics).
  4. Generate reports and export results to Excel or PDF.

By creating a project flow, you can automate this entire process:

  1. Drag and drop the tasks into the process flow window.
  2. Link the tasks by clicking and dragging from one task to the next.
  3. Set the dependencies so that tasks execute in the correct order.
  4. Run the entire flow with a single click.

Tip 3: Leverage Query Builder for Complex Data Manipulation

The Query Builder in SAS EG is a powerful tool for data manipulation, allowing you to filter, sort, join, and summarize data without writing SQL code. It’s especially useful for those who may not be familiar with SQL or prefer a visual interface.

Example:

Suppose you need to merge two datasets, filter the merged dataset to include only certain records, and then summarize the data:

  1. Open the Query Builder and add both datasets to the workspace.
  2. Create a join by dragging and dropping the key variable from one dataset to the corresponding variable in the other dataset.
  3. Use the Filter tab to specify the conditions for including records in the output (e.g., only include records where the value of a variable is greater than 100).
  4. Use the Summary tab to group the data by a categorical variable and calculate summary statistics (e.g., mean, sum, count).
  5. Click Run to execute the query and view the results.

Tip 4: Customize Output with Style Templates

SAS EG allows you to apply custom style templates to your output, enabling you to create professional-looking reports that align with your organization's branding or specific formatting requirements.

Example:

To apply a custom style template to your output:

  1. Run your analysis task (e.g., summary statistics, regression).
  2. In the task’s Results tab, select the Style option.
  3. Choose a predefined style template from the dropdown menu, or create your own by clicking Manage Styles and defining custom settings (e.g., fonts, colors, borders).
  4. Run the task to generate the output with your chosen style applied.

This is especially useful when generating reports for different audiences or when you need to adhere to specific corporate guidelines.

Tip 5: Optimize Performance with Data Management Best Practices

Managing large datasets efficiently is crucial in SAS EG, particularly when working with complex analyses or resource-intensive tasks. By following data management best practices, you can significantly improve performance.

Example:

Consider a scenario where you need to process a large dataset:

  • Use Data Filters: Apply filters early in your data preparation process to reduce the number of records processed in subsequent tasks. This can be done directly in the Query Builder or as part of a data step.
  • Sort Data Efficiently: Sorting large datasets can be time-consuming. Use indexed variables for sorting when possible, and avoid unnecessary sorting operations.
  • Limit Data Movement: Minimize the movement of data between different environments (e.g., from a remote server to a local machine). Instead, process data in place whenever possible.
  • Use Summary Statistics: Instead of processing entire datasets, calculate and store summary statistics when possible, reducing the need to repeatedly process large volumes of raw data.

Tip 6: Debugging with the SAS Log

The SAS log is an invaluable tool for debugging your SAS EG projects. It provides detailed information about the execution of your tasks, including any errors, warnings, or notes.

Example:

When you encounter an unexpected result or error:

  1. Click on the Log tab in SAS EG after running a task.
  2. Review the log for any error messages (marked in red) or warnings (marked in yellow).
  3. Pay attention to the line numbers and the code snippets provided in the log to identify the source of the issue.
  4. Use the PUTLOG statement in your code to output custom messages to the log for more detailed debugging information.
  5. Resolve the issue based on the information provided in the log, and rerun the task to confirm that the problem is fixed.

Tip 7: Create Macros to Automate Repetitive Tasks

Macros in SAS EG can help automate repetitive tasks and make your projects more efficient. By creating reusable code snippets, you can simplify complex processes and reduce the likelihood of errors.

Example:

Suppose you need to apply the same data transformation across multiple datasets:

  1. Create a macro that performs the desired transformation (e.g., log transformation of a variable).
  2. Call the macro for each dataset, passing the dataset name as a parameter.
  3. Run the macro across all datasets with a single command, ensuring consistent transformations across the board.

Example code:


%macro transform_data(dataset);
    data &dataset._transformed;
        set &dataset;
        log_variable = log(variable);
    run;
%mend transform_data;

%transform_data(dataset1);
%transform_data(dataset2);
%transform_data(dataset3);
    

Conclusion

SAS Enterprise Guide is a powerful tool that offers a wide range of features to enhance your data analysis and reporting capabilities. By applying the tips and techniques outlined in this report, you can streamline your workflows, improve efficiency, and produce high-quality results. Whether you’re automating repetitive tasks, optimizing performance, or customizing your output, these strategies will help you make the most of SAS EG.

Developing the DC (Demographics as Collected) SDTM Domain

Developing the DC (Demographics as Collected) SDTM Domain

Developing the DC (Demographics as Collected) SDTM Domain: Tips, Techniques, Challenges, and Best Practices

Introduction

The DC (Demographics as Collected) domain is a specialized SDTM (Study Data Tabulation Model) domain designed to capture demographic data as it was originally collected in clinical trials. This domain is particularly valuable in studies that involve multiple screenings or re-enrollments, where maintaining the integrity of the collected data is crucial. Unlike the DM (Demographics) domain, which provides a standardized summary of demographic data, the DC domain preserves the raw, unstandardized data, ensuring accurate data representation.

This article delves into the key aspects of developing the DC domain, including the challenges, best practices, and the critical differences between the DC and DM domains. We also integrate insights from industry papers and guidelines, including an in-depth look at the FDA's recommendations on handling multiple screenings, the implications of SDTMIG 3.3 DM assumptions, and linking SUBJID to USUBJID, to provide a comprehensive guide.

Understanding the DC Domain Structure

The DC domain captures raw demographic data with variables such as:

  • DCSEQ: Sequence Number
  • DCTESTCD: Collected Test Code
  • DCTEST: Collected Test Name
  • DCORRES: Original Result
  • DCORRESU: Original Units
  • DCSTRESC: Standardized Result in Character
  • DCSTRESN: Standardized Result in Numeric
  • DCSTRESU: Standardized Units
  • VISITNUM: Visit Number
  • VISIT: Visit Name
  • DCDTC: Date/Time of Collection

Challenges in Developing the DC Domain

1. Handling Rescreened Subjects

Challenge: Allowing subjects to undergo multiple screenings or re-enrollments in a study can complicate data management, especially when deciding how to represent each instance of participation.

Solution: The FDA recommends that the primary enrollment should be included in the DM domain, while additional screenings or enrollments should be recorded in a custom domain with a structure similar to DM, such as the DC domain. The DC domain should capture each screening or enrollment instance using unique subject identifiers (SUBJID), while maintaining a single unique subject identifier (USUBJID) across all instances.

2. Inconsistent Data Collection Methods

Challenge: Variations in data collection methods across sites or time points can lead to inconsistencies in the collected data.

Solution: Implement standardized protocols and training across sites to ensure consistency. The DC domain can capture these variations without standardization, preserving the raw data for future analysis.

3. Mapping Raw Data to SDTM Variables

Challenge: Accurately mapping raw demographic data to SDTM variables can be complex, particularly when data is collected in non-standard formats.

Solution: Utilize automated mapping tools and validate mappings through manual review. The DC domain should capture the data as collected, with minimal transformations.

4. Managing Validation Issues

Challenge: Custom domains like DC may trigger validation warnings in SDTM, especially when SUBJID is used across multiple domains.

Solution: Document the rationale for using the DC domain in the clinical Study Data Reviewer’s Guide (cSDRG) and prepare to address any validation warnings with clear explanations.

FDA Recommendations on Multiple Screenings

  1. Inclusion in DM and Custom Domains: The FDA recommends that the DM domain should include only the primary screening or enrollment of a subject. If a subject undergoes multiple screenings or enrollments, the primary instance should be captured in the DM domain, while additional screenings or enrollments should be included in a custom domain like the DC domain, which has a similar structure to DM.
  2. Handling Screen Failures: For subjects who fail the initial screening and are subsequently rescreened, the primary screening failure should be included in the DM domain, while the rescreening attempts are recorded in the DC domain. This ensures that all screening attempts are documented and available for analysis, while maintaining a clear distinction between successful enrollments and failures.
  3. Use of SUBJID Across Domains: The FDA also recommends using the SUBJID variable in related domains beyond DM, even if this causes validation warnings. This approach is crucial for linking all participation instances of a subject, especially in cases of multiple screenings or enrollments. It allows for a comprehensive view of the subject's participation history within the study.
  4. Alignment with Global Standards: The FDA's recommendations may differ from those of other regulatory bodies, such as the Japan Pharmaceuticals and Medical Devices Agency (PMDA) and the China National Medical Products Administration (NMPA). This discrepancy can present challenges when preparing submissions for multiple regulatory authorities. In such cases, careful documentation and clear communication with the relevant regulatory bodies are essential.

SDTMIG 3.3 DM Assumptions

  1. Primary Enrollment in DM: The DM domain should include only the primary enrollment of a subject. For subjects with multiple enrollments, additional records should be included in a custom domain, such as DC.
  2. RFICDTC Correspondence: The variable RFICDTC (Date/Time of Informed Consent) in the DM domain should correspond to the date of the first informed consent protocol milestone recorded in the DS domain. If there are multiple informed consents, the first one is used in DM.
  3. RFXSTDTC and RFXENDTC Usage: The variables RFXSTDTC (Date/Time of First Study Treatment) and RFXENDTC (Date/Time of Last Study Treatment) represent the date/time of the first and last study exposure, respectively. These are used in the DM domain to accurately reflect the subject’s exposure timeline.
  4. Handling Multiple Screenings: For subjects who undergo multiple screenings but are not subsequently enrolled, the primary screening should be included in DM, with additional screenings captured in a custom domain like DC. This approach ensures that DM reflects only the most relevant participation instance.

Contrast Between DC and DM Domains

Understanding the distinction between the DC and DM domains is crucial for correctly mapping data:

  • DC Domain (Demographics as Collected):
    • Purpose: Captures demographic data exactly as it was collected, without standardization or imputation. It is particularly useful for studies involving multiple screenings or enrollments.
    • Data Types: Raw, unprocessed data that reflects the original data entry, including all collected demographic characteristics such as age, sex, race, and ethnicity.
    • Example: If age was collected as "45" and sex as "M," these values would be recorded exactly as they are in the DC domain, with appropriate units and codes.
  • DM Domain (Demographics):
    • Purpose: Provides a standardized, baseline snapshot of demographic data for each subject, used in analysis and reporting. The DM domain is typically a derived subset of the DC domain.
    • Data Types: Standardized data, often derived or transformed from raw data. It may include derived variables such as age calculated at the screening date, or standardized values for sex and race.
    • Example: In the DM domain, the age might be presented as "45" calculated based on a reference date, and sex might be converted to "Male" using controlled terminology.
Variable DC Domain DM Domain
Age DCORRES = "45", DCORRESU = "Years" AGE = 45 (derived from birthdate)
Sex DCORRES = "M" SEX = "Male"
Race DCORRES = "Caucasian" RACE = "White"
Visit Name VISIT = "Screening" Not applicable

Best Practices for Developing the DC Domain

  • Ensure Accurate Mapping of Source Data: Validate that raw data is accurately mapped to DC domain variables, paying particular attention to variable types and units.
  • Use Controlled Terminology Where Applicable: Ensure DCTESTCD and DCTEST align with CDISC controlled terminology. If terms are missing or ambiguous, document any decisions made.
  • Handle Missing Data Appropriately: Follow SDTM conventions for representing missing data. Document any assumptions or imputations made in the process.
  • Implement Proper Version Control: Track changes to the DC domain throughout the study with clear versioning and documentation.
  • Visualize Data with Tables and Graphs: Use tools like SAS to visualize demographic data, allowing for easier identification of errors and outliers.
  • Validate the DC Domain: Regularly validate your domain using tools like Pinnacle 21 and manual checks to ensure compliance with SDTM standards.
  • Document Everything: Maintain thorough documentation for every step, from data collection to final SDTM mapping.

Conclusion

The development of the DC domain is not just a routine task—it is a critical step in ensuring the integrity and accuracy of your study’s demographic data. By understanding the challenges and differences between the DC and DM domains, and by implementing the tips and techniques discussed, you can ensure that your DC domain is accurate, compliant, and ready for submission.

Next Steps:

  • Assess your current processes for developing the DC domain.
  • Implement the strategies outlined to enhance accuracy and consistency.
  • Train your team on the distinctions between the DC and DM domains to avoid common pitfalls.

References

  1. Matta, V., Jajam, S., & Peddibhotla, L. (2021). Rescreened Subjects, Data Collection and Standard Domains Mapping. Covance Inc.
  2. Zhou, X., Xie, L., Hu, Q., & Ma, S. (2023). Exploration on Demographic as Collected (DC) Domain to Handle Multiple Screenings in SDTM. BeiGene, Inc.

Efficient Directory Management in SAS: Copying Directories

Efficient Directory Management in SAS: Copying Directories

Mastering Directory Management in SAS: A Guide to Copying Directories

In data management and processing, efficiently handling directories is crucial. Whether you're consolidating project files or reorganizing data storage, copying directories from one folder to another can streamline your workflow. In this blog post, we'll explore a powerful SAS script that automates this task, ensuring you can manage your directories with ease and precision.

Objective

The goal of this SAS script is to copy all directories from a source folder to a target folder. This can be particularly useful for tasks such as archiving, backup, or restructuring data storage. Below, we provide a comprehensive breakdown of the SAS code used to achieve this.

SAS Code for Copying Directories

%let source=/data/projects/2024/Research/Files ;
%let target=/data/projects/2024/Research/Backup ;

data source ;
  infile "dir /b ""&source/"" " pipe truncover;
  input fname $256. ;
run; 

data target ;
  infile "dir /b ""&target/"" " pipe truncover;
  input fname $256. ;
run; 

proc sql noprint ;
  create table newfiles as
    select * from source
    where not (upcase(fname) in (select upcase(fname) from target ));
quit;

data _null_;
   set newfiles ;
  cmd = catx(' ','copy',quote(catx('/',"&source",fname)),quote("&target"));
   infile cmd pipe filevar=cmd end=eof ;
   do while (not eof);
     input;
     put _infile_;
   end;
run;

How It Works

This SAS script performs several key operations to ensure that directories are copied effectively from the source folder to the target folder:

  1. Define Source and Target Folders: The script begins by specifying the source and target folder paths using macro variables. This makes it easy to adjust the paths as needed.
  2. List Directories in Source and Target: Two data steps are used to list all directories in the source and target folders. This is done using the infile statement with a pipe command that executes the dir /b command to retrieve directory names.
  3. Identify New Directories: A PROC SQL step compares the directory names in the source and target folders. It creates a new dataset newfiles containing directories that are present in the source but not in the target folder.
  4. Copy Directories: Finally, a data step constructs and executes a command to copy each new directory from the source to the target folder. The catx function is used to build the copy command, and the infile statement with a pipe executes the command.

Usage Example

To use this script, replace the source and target paths with your desired directories. The script will automatically handle the rest, ensuring that all directories in the source that do not already exist in the target are copied over.

%let source=/path/to/source/folder ;
%let target=/path/to/target/folder ;
/* Run the script as shown above */

Conclusion

Efficiently managing directories is essential for data organization and project management. This SAS script provides a robust solution for copying directories from one folder to another, helping you keep your data well-structured and accessible. By incorporating this script into your workflow, you can automate the process of directory management and focus on more critical aspects of your projects.

Feel free to customize the script to fit your specific needs, and happy coding!

SAS Macro for Directory Management

SAS Macro for Directory Management

Efficient Directory Management in SAS: A Custom Macro

Managing directories effectively is crucial for organizing and handling large volumes of files in SAS. In this article, we'll walk through a custom SAS macro that helps you identify all folders within a specified directory. This macro is particularly useful for managing directory structures in complex projects.

Macro Overview

The get_folders macro is designed to list all folders present in a specified directory. It verifies the existence of the directory, retrieves the names of all items within it, and outputs this information in a readable format. Below is the complete SAS code for this macro:

%macro get_folders(dir);
    /* 
       Macro: get_folders
       Purpose: Identifies all folders available within a specified directory location.
       Source: Custom macro developed for directory management in SAS.
       Date: September 2024
    */
    
    /* CHECK FOR EXISTENCE OF DIRECTORY PATH */
    %if %sysfunc(fileexist(&dir)) %then %do;
    
    /* ASSIGNS THE FILEREF OF MYDIR TO THE DIRECTORY AND OPENS THE DIRECTORY */
    %let filrf=mydir;
    %let rc= %sysfunc(filename(filrf,&dir));
    %let did= %sysfunc(dopen(&filrf));
    
    /* RETURNS THE NUMBER OF MEMBERS IN THE DIRECTORY */
    %let memcnt= %sysfunc(dnum(&did));
    
    %put rc=&rc;
    %put did=&did;
    %put memcnt=&memcnt;
    
    data Dir_Contents;
    length member_name $ 32;
    /* LOOPS THROUGH ENTIRE DIRECTORY */
    %do i = 1 %to &memcnt;
        member_name="%qsysfunc(dread(&did,&i))";
        put 'member_name ' member_name;
        output;
    %end;
    run;
    
    TITLE "CONTENTS OF FOLDER &DIR";
    proc print data=dir_contents;
    run;
    
    /* CLOSES THE DIRECTORY */
    %let rc= %sysfunc(dclose(&did));
    %end;
    %else %do;
    %put ERROR: Folder &dir Not Found;
    %end;
    
    %mend get_folders;
    
    /* Example usage of the macro */
    %get_folders('/example/directory/path');
    

How It Works

Here's a step-by-step breakdown of the macro:

  1. Check Directory Existence: The macro first checks if the specified directory exists using the %sysfunc(fileexist) function. If the directory does not exist, an error message is displayed.
  2. File Reference and Directory Opening: If the directory exists, a file reference is assigned, and the directory is opened using the %sysfunc(filename) and %sysfunc(dopen) functions.
  3. Count Directory Members: The macro retrieves the number of items (folders or files) in the directory with %sysfunc(dnum).
  4. Retrieve and Output Folder Names: Using a data step, the macro loops through each item in the directory, retrieves its name with %qsysfunc(dread), and outputs this information to a dataset.
  5. Display Contents: The contents of the dataset are printed using PROC PRINT.
  6. Close the Directory: Finally, the directory is closed with %sysfunc(dclose).

Usage Example

To use this macro, simply call it with the directory path you want to scan:

%get_folders('/example/directory/path');

This will list all folders within the specified directory, making it easier to manage and organize your files.

Conclusion

The get_folders macro is a powerful tool for directory management in SAS. By incorporating this macro into your workflow, you can streamline the process of identifying and organizing folders within your projects. Feel free to modify and adapt the macro to suit your specific needs.

Happy coding!

SAS Functions: SOUNDEX, COMPGED, and Their Alternatives

SAS Functions: SOUNDEX, COMPGED, and Their Alternatives

SAS Functions: SOUNDEX, COMPGED, and Their Alternatives

Introduction

In SAS, the SOUNDEX and COMPGED functions are powerful tools for text comparison, particularly when dealing with names or textual data that may have variations. In addition to these, SAS offers other functions like DIFFERENCE and SPEDIS that provide additional ways to measure similarity and distance between strings. This article explores these functions, provides examples, and compares their uses.

The SOUNDEX Function

The SOUNDEX function converts a character string into a phonetic code. This helps in matching names that sound similar but may be spelled differently. The function generates a four-character code based on pronunciation.

Syntax

SOUNDEX(string)

Where string is the character string you want to encode.

Example

data names;
    input name $20.;
    soundex_code = soundex(name);
    datalines;
John
Jon
Smith
Smythe
;
run;

proc print data=names;
run;

In this example, "John" and "Jon" have the same SOUNDEX code, reflecting their similar pronunciation, while "Smith" and "Smythe" have different codes.

The COMPGED Function

The COMPGED function measures the similarity between two strings using the Generalized Edit Distance algorithm. This function is useful for fuzzy matching, especially when dealing with misspelled or slightly varied text.

Syntax

COMPGED(string1, string2)

Where string1 and string2 are the strings to compare.

Example

data comparisons;
    string1 = 'John';
    string2 = 'Jon';
    distance = compged(string1, string2);
    datalines;
;
run;

proc print data=comparisons;
run;

The COMPGED function returns a numerical value representing the edit distance between "John" and "Jon". Lower values indicate higher similarity.

Alternative Functions

The DIFFERENCE Function

The DIFFERENCE function returns the difference between the SOUNDEX values of two strings. This function is useful for comparing the phonetic similarity of two strings directly.

Syntax

DIFFERENCE(string1, string2)

Where string1 and string2 are the strings to compare.

Example

data soundex_comparison;
    input name1 $20. name2 $20.;
    diff = difference(name1, name2);
    datalines;
John Jon
Smith Smythe
;
run;

proc print data=soundex_comparison;
run;

In this example, the DIFFERENCE function compares the SOUNDEX values of "John" and "Jon", and "Smith" and "Smythe". Lower values indicate more similar phonetic representations.

The SPEDIS Function

The SPEDIS function measures the similarity between two strings based on the Soundex encoding and a variant of the Generalized Edit Distance. This function is useful for matching names with variations in spelling.

Syntax

SPEDIS(string1, string2)

Where string1 and string2 are the strings to compare.

Example

data spedisp_comparison;
    string1 = 'John';
    string2 = 'Jon';
    spedis_score = spedis(string1, string2);
    datalines;
;
run;

proc print data=spedisp_comparison;
run;

The SPEDIS function returns a score reflecting the similarity between "John" and "Jon". A lower score indicates higher similarity, similar to COMPGED, but with a different approach to similarity measurement.

Comparison of Functions

Here’s a quick comparison of these functions:

  • SOUNDEX: Encodes a string into a phonetic code. Useful for phonetic matching, but limited to sounds and does not consider spelling variations.
  • COMPGED: Uses the Generalized Edit Distance algorithm to measure string similarity. Suitable for fuzzy matching with spelling variations.
  • DIFFERENCE: Compares the phonetic similarity of two strings based on their SOUNDEX values. Provides a direct measure of phonetic similarity.
  • SPEDIS: Measures similarity using a combination of Soundex and Edit Distance. Useful for matching names with spelling variations and phonetic differences.

Conclusion

The SOUNDEX and COMPGED functions are valuable tools for text comparison in SAS. By understanding their characteristics and how they compare to other functions like DIFFERENCE and SPEDIS, you can choose the most appropriate method for your specific text matching needs. Each function offers unique advantages depending on the nature of the text data and the type of comparison required.

Using SUPPQUAL for Specifying Natural Key Variables in Define.XML

Using SUPPQUAL for Specifying Natural Key Variables in Define.XML

Using SUPPQUAL for Specifying Natural Key Variables in Define.XML

Author: Sarath

Introduction

Define.XML plays a critical role in specifying dataset metadata, particularly in the context of clinical trial data. One important aspect of define.xml is the identification of natural keys, which ensure the uniqueness of records and define the sort order for datasets.

Using SUPPQUAL for Natural Keys

SUPPQUAL, or Supplemental Qualifiers, is a structure used in SDTM/SEND datasets to capture additional attributes related to study data that are not part of the standard domains. In certain cases, the standard SDTM/SEND variables may not be sufficient to fully describe the structure of collected study data. In these cases, SUPPQUAL variables can be utilized as part of the natural key to ensure complete and accurate dataset representation.

Example Scenarios

Consider a scenario where multiple records exist for a single subject in a dataset, with additional details captured in SUPPQUAL. If the standard variables (e.g., USUBJID, VISITNUM, --TESTCD) do not uniquely identify a record, SUPPQUAL variables such as QNAM or QVAL can be incorporated to achieve uniqueness.

Strategies for Incorporating SUPPQUAL Variables

When incorporating SUPPQUAL variables into the natural key, it is important to:

  • Select SUPPQUAL variables that are consistently populated and relevant to the uniqueness of the records.
  • Ensure that the selected SUPPQUAL variables contribute to the overall sort order and are aligned with the study's data structure.

Documenting SUPPQUAL Natural Keys in Define.XML

Documenting SUPPQUAL variables in define.xml requires careful attention to detail. Here is a step-by-step guide:

  1. Identify the SUPPQUAL variables that need to be included in the natural key.
  2. In the ItemGroupDef section of define.xml, ensure that these variables are listed as part of the Keys attribute.
  3. Provide clear documentation in the ItemDef section, describing the role of each SUPPQUAL variable in the natural key.

Example XML snippet:

<ItemGroupDef OID="IG.SUPPQUAL" Name="SUPPQUAL" Repeating="Yes" IsReferenceData="No" Purpose="Tabulation">
    <!-- Define the key variables -->
    <ItemRef ItemOID="IT.USUBJID" OrderNumber="1" KeySequence="1"/>
    <ItemRef ItemOID="IT.RDOMAIN" OrderNumber="2" KeySequence="2"/>
    <ItemRef ItemOID="IT.IDVARVAL" OrderNumber="3" KeySequence="3"/>
    <ItemRef ItemOID="IT.QNAM" OrderNumber="4" KeySequence="4"/>
</ItemGroupDef>
    

Conclusion

Using SUPPQUAL variables as part of the natural key in define.xml can be a powerful strategy for ensuring accurate and comprehensive dataset documentation. By carefully selecting and documenting these variables, you can enhance the quality and integrity of your clinical trial data.

References

  1. CDISC Define-XML Specification, Version 2.0. Available at: https://www.cdisc.org/standards/foundational/define-xml
  2. CDISC SDTM Implementation Guide, Version 3.2. Available at: https://www.cdisc.org/standards/foundational/sdtm
  3. FDA Study Data Technical Conformance Guide. Available at: https://www.fda.gov/media/130878/download
  4. SAS Support - Define-XML 2.0: Generating XML Content with SAS. Available at: https://support.sas.com/resources/papers/proceedings15/3273-2015.pdf
  5. How to use SUPPQUAL for specifying natural key variables in define.xml? Available at: https://www.lexjansen.com/phuse/2019/si/SI07.pdf