StudySAS Blog: Mastering Clinical Data Management with SAS

Welcome to StudySAS, your ultimate guide to clinical data management using SAS. We cover essential topics like SDTM, CDISC standards, and Define.XML, alongside advanced PROC SQL and SAS Macros techniques. Whether you're enhancing your programming efficiency or ensuring compliance with industry standards, StudySAS offers practical tips and insights to elevate your clinical research expertise. Join us and stay ahead in the evolving world of clinical data.

How to avoid data set merging problems when common BY variable has different lengths?

When merging 2 datasets with a common by-variable and when that common variable has different variable length, the merge process produce unexpected results. If you use SAS 9.2 version like me, then SAS Data step will inform you with the following warning:

WARNING: Multiple lengths were specified for the BY variable ****** by input data sets. This may cause unexpected results.

It is good that at least starting SAS 9.2 version, data step issues a Warning message to inform the programmer. But if you use before versions, it is difficult to notice this potential disaster.

When you see this WARNING message in the SAS log, we might be inclined to ignore this warning because we think this is just a WARNING never realizing the potential danger. When you see this message in the LOG we should be thinking about this instead of avoiding because SAS will do exactly what it states: it may cause unexpected results. In some cases merge won’t even happen between datasets and sometimes the partial merge between the datasets.

Let’s look at the following example.

data table1;

length visit $13; * LENGTH IS 13;

visit = "CYCLE 1 DAY 1";

visitnum = 1;

run;

data table2;

length visit $14; * LENGTH IS 14;

visit = "CYCLE 1 DAY 10";

visitnum = 3;

run;

proc sort data=table1; by visit;run;

proc sort data=table2; by visit;run;

TABLE 1;

VISIT	VISITNUM
CYCLE 1 DAY 1	1

TABLE 2;

VISIT	VISITNUM
CYCLE 1 DAY 10	3

*Merge 2 datasets together with common BY variable with different lengths;

data table12;

merge table1 table2;

by visit;

run;

*Here is the LOG FILE;

2714

2715 data table1_2;

2716 merge table1 table2;

2717 by visit;

2718 run;

WARNING: Multiple lengths were specified for the BY variable VISIT by input data sets. This may cause unexpected results.

NOTE: There were 1 observations read from the data set WORK.TABLE1.

NOTE: There were 1 observations read from the data set WORK.TABLE2.

NOTE: The data set WORK.TABLE1_2 has 1 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

As a result of different lengths, SAS adds only one record to the output dataset rather than 2.

*WRONG OUTPUT CREATED;

VISIT	VISITNUM
CYCLE 1 DAY 1	3

*CORRECT OUTPUT SHOULD BE;

VISIT	VISITNUM
CYCLE 1 DAY 1	1
CYCLE 1 DAY 10	3

*To avoid this potential disaster, I’ve used Proc SQL and then created a macro variable with the maximum length of Visit variable in all the datasets in the WORK directory.;

proc sql;

select max(length) into :maxlength

from sashelp.vcolumn

where libname='WORK'

and name="VISIT";

quit;

*Visit length form TABLE1 is 13 and from TABLE2 is 14, so since I know the maximum length I will used that in here;

data table1_2;

length visit $ &maxlength;

merge table1 table2;

by visit;

run;

*THIS RESULTS IN CORRECT OUTPUT;

VISIT	VISITNUM
CYCLE 1 DAY 1	1
CYCLE 1 DAY 10	3

Basic Differences Between Proc MEANS and Proc SUMMARY

Though Proc Means and Proc Summary are 2 different procedures essentially used to compute descriptive statistics of numeric variables, but there are differences between these two. (

1) By default, Proc MEANS produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.

2) Proc Summary only produces the descriptive statistics for the variables that are specified in the VAR statement, where as Proc Means by default, computes the descriptive statistics of the numeric variables even without the VAR statement.

Here is a post which details the differences:

Direct Link:

Excerpt:

Proc SUMMARY and Proc MEANS are essentially the same procedure. Both procedures compute descriptive statistics. The main difference concerns the default type of output they produce. Proc MEANS by default produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not. Inclusion of the print option on the Proc SUMMARY statement will output results to the output window.

The second difference between the two procedures is reflected in the omission of the VAR statement. When all variables in the data set are character the same output: a simple count of observations, is produced for each procedure. However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables not listed in any of the other statements and produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum).

Exploring the Analysis Data Model – ADaM Datasets

Today, I stumbled upon a blog which is interesting and resourceful. I liked the article so much so want to share with all my friends here.

Here is the direct link for the post to download or to review:
Actual Article:

The Analysis Data Model (ADaM) is a standard released by the Clinical Data Interchange Standards Consortium (CDISC) and has quickly become widely used in the submission of clinical trial information. ADaM has very close ties to another of CDISCs released standards, Study Data Tabulation Model (SDTM).

The main difference between these two CDISC standards is the way in which the data is displayed. SDTM provides a standard for the creation and mapping of collected data from Raw sources, where as ADAM provides a standard for the creation of analysis-ready data, often using SDTM data as the source.

The purpose of the analysis-ready ADaM data is to provide the programmer with a means to create tables, listings and figures with minimal time and effort whilst ensuring a clear level of traceability in the derived values. This is a key factor of ADaM data as there is a need for a clear and unambiguous flow from the study tabulation data to the analysis data which supports the statistical analyses performed in a clinical study.

CDISC state the following key principles for Analysis Datasets:

facilitate clear and unambiguous communication and provide a level of traceability
be useable by currently available tools
be linked to machine-readable metadata
be analysis-ready

To perform statistical analysis on a study, data maybe required from many domains, such as labs, adverse events, demographics and subject characteristics. Bringing this data into ADaM datasets and performing any complex derivations required for display endpoints means that no further data manipulation is required to produce statistical outputs.

When creating the ADaM datasets the requirements of the analyses must be taken into consideration. This will ensure the desired numbers of datasets are produced – at the very least; a subject level dataset is required. Some of the data will be duplicated between domains, for example Age and Gender data. This is acceptable as this will aid the output creation or data review.

The naming convention for the datasets will follow “ADxxxx”, where the “xxxx” part will be sponsor-defined - AE for adverse events, LB for Laboratory results for example. The subject-level dataset, which will be discussed later, will be named “ADSL”. For the ADaM variables, the naming conventions should follow the standardized variable names defined in the ADaM Implementation Guide (ADaMIG). Any variables from the SDTM which are used directly in the ADaM dataset should keep the same variable name to avoid confusion. Sponsor-defined variable names will be given to any other analysis variables. Following these conventions will provide clarity for the reviewer.

As previously mentioned, a key requirement for ADaM data is a subject-level analysis dataset. This dataset and its documentation are always required – even if no other data is submitted. The subject-level dataset, or “ADSL” as it is named within ADaM conventions, contains a record for each subject with variables which display key information for subject disposition, demographic, and baseline characteristics.

Other variables within ADSL will contain planned or actual treatment group information as well key dates and times of the subjects study participation on the study. Not all variables within ADSL may be used directly for analysis but could be used in conjunction with other datasets for display or grouping purposes or possibly included simply as variables of interest for review.

To conclude, the CDISC summary of ADSL is as follows: “The critical variables in ADSL will include those that are either descriptive, known to affect the subject’s response to drug (in terms of either efficacy or safety), used as strata for randomization, or identify the subject or event as belonging to specific subgroups (e.g. population flags). For example, subjects may be randomized after being stratified by age group because it is believed that younger subjects respond differently to the study drug. In this situation, a subject’s age category would be considered a critical variable for a study and included in ADSL.”

I hope you guys liked it.

Sarath

How to use MISSING(), NMISS() and the CMISS() functions

SAS provides several functions to test for missing values but in this post we will focus on MISSING(), CMISS() and NMISS() functions. The NMISS() function is reserved for numeric variables. The MISSING() and CMISS() functions can be used with either character or numeric variables. The CMISS() and NMISS() functions are designed by SAS to count the number of arguments with missing values whereas the MISSING function checks whether or not a variable is missing. The MISSING(), CMISS(), and NMISS() functions provide a simple approach to check for missing values and these functions let you write few lines of code by avoiding large if-statements when you need to check for missing values in several values at the same time.

MISSING() function is very useful when you need to check any variable has a missing value or not, but not sure if it’s a character or numeric? MISSING function works for either character or numeric variables and it also checks for the special numeric missing values (.A, .B,.C ._ etc)as well. The MISSING() function produces a numeric result (0 or 1) if the data point is present or missing. MISSING(varname) is the same as MISSING(varname)=1. MISSING(varname)=0 specifies when the data point is present.

The MISSING function is particularly useful if you use special missing values since 'if varname=.' will not identify all missing values in such cases.

NOTE: Missing value is not consistent in SAS as it changes between numeric and character variables. A single period (.) represents the Numeric missing value. A single blank enclosed in single or double quotes (' ' or “ ” ) represents the Character missing value. A single period followed by a single letter or an underscore (ex: .A, .B, .Z, ._) represents Special numeric missing values. Please note that these special missing values available for numeric variables only.

The NMISS() function will count the number of arguments with missing values in the specified list of numeric variables. NMISS() function is very useful if you want to make sure if at least one variable in the list is not missing.

The CMISS() is available with SAS 9.2 and SAS Enterprise Guide 4.3 and is similar to the NMISS() function. The only difference is that it counts the number arguments that are missing for both character and numeric variables.

The NMISS() function returns the number of argument variables which have missing values. NMISS works with multiple numeric values, whereas MISSING works with only one value that can be either numeric or character.

Examples:

* count the number of the variables A, B, and C which have missing values;

count=nmiss(A, B, C);

count=nmiss(of A B C);

* count the number of the variables from Var1 to Var10 which have missing values;

count=nmiss(of var1-var10);

Examples:

x1=nmiss(1,0,.,2,5,.);	2
x2=nmiss(1,0);	0
x3=nmiss(of x1-x2);	0

For more details refer to this page. (USING the CMISS, NMISS and MISSING FUNCTIONS)

For more details regarding the special missing values, please also refer to Special Missing Values in SAS (http://studysas.blogspot.com/2010/04/special-missing-values.html).

References:

1) Missing values in SAS (http://www.pauldickman.com/teaching/sas/missing.php);

2) MISSING! - Understanding and Making the Most of Missing Data: SUGI 31: Suzanne M. Humphreys, PRA International, Victoria, BC (Canada).

3) Special Missing Values in SAS (http://studysas.blogspot.com/2010/04/special-missing-values.html)

4) Usage Note 36480 KNOWLEDGE BASE / SAMPLES & SAS NOTES from support.sas.com

5) SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition.

6) Carpenter's Guide to Innovative SAS Techniques, Art Carpenter (Page:99)

Studyday calculation ( --DY Variable in SDTM)

USE OF THE “STUDY DAY” VARIABLES

The permissible Study Day variables (--DY, --STDY, and --ENDY) describe the relative day of the observation starting with the reference date as Day 1. They are determined by comparing the date portion of the respective date/time variables (--DTC, --STDTC, and --ENDTC) to the date portion of the Subject Reference Start Date (RFSTDTC from the Demographics domain).

The Subject Reference Start Date (RFSTDTC) is designated as Study Day 1. The Study Day value is incremented by 1 for each date following RFSTDTC. Dates prior to RFSTDTC are decremented by 1, with the date preceding RFSTDTC designated as Study Day -1 (there is no Study Day 0). This algorithm for determining Study Day is consistent with how people typically describe sequential days relative to a fixed reference point, but creates problems if used for mathematical calculations because it does not allow for a Day 0. As such, Study Day is not suited for use in subsequent numerical computations, such as calculating duration. The raw date values should be used rather than Study Day in those calculations.

Reference: Study Data Tabulation Model Implementation Guide v3.1.2 (Page No 40).

You will find that you need to create --DY and or --STDY /--ENDY varianles in almost all the SDTM domains. Because the process of the derivation is same, it makes sense to create a macro code and use it across all the domains...

/****************************************************************

*Study Number :ABCD_0123
*Sponsor Protocol Number : ABC1004
*Program Name : studyday.sas
*Program Location : X:\PROJECT\DEPT\ABC1004\Progs\macros
*Description : StudyDAY Macro
*Program Author : Sarath Annapareddy
*Creation Date : 13-Jul-2012
*Macro Parameters:
rfdate: --DTC variable used to calculate Study day variable.
var : --DTC variable used to calculate the Study day to.
dy : Prefix of the Study day variable
dsn : Dataset in which the --DTC variable used to calculate the Study day to exists.

*Notes: Macro must be used outside the datastep.
****************************************************************;
/************* Setup Section ************/
/**************************************************************/

%macro make_studyday(dsn,var,dy,rfdate);

*Getting the Baseline or Reference start date from DM dataset;
proc sort data=interim.dm out=dm(keep=usubjid rfstdtc);
by usubjid;
run;

proc sort data=&dsn;
by usubjid;
run;

data &dsn;
merge &dsn (in=a) dm;
by usubjid;
if a;
/*Numeric date variable;*/
&rfdate._n=input(substr(&rfdate,1,10),anydtdte10.);
&var._n=input(substr(&var,1,10),anydtdte10.);

/*Study day derivation;*/
if nmiss(&var._n,&rfdate._n)=0 then &dy=&var._n-&rfdate._n+(&var._n>=&rfdate._n);
run;
%mend;

A sample macro call of this SAS macro for the Adverse Events (AE) domain might look like this:

%make_studyday(ae,aestdtc,aestdy,rfstdtc);
%make_studyday(ae,aeendtc,aeendy,rfstdtc);
%make_Gstudyday(ae,aedtc,aedy,rfstdtc);

For pre-dose:

studyday= the event/visit date – first dose date

For post-dose:

studyday= the event/visit date – first dose date + 1

Disclosure:

In the spirit of transparency and innovation, I want to share that some of the content on this blog is generated with the assistance of ChatGPT, an AI language model developed by OpenAI. While I use this tool to help brainstorm ideas and draft content, every post is carefully reviewed, edited, and personalized by me to ensure it aligns with my voice, values, and the needs of my readers. My goal is to provide you with accurate, valuable, and engaging content, and I believe that using AI as a creative aid helps achieve that. If you have any questions or feedback about this approach, feel free to reach out. Your trust and satisfaction are my top priorities.

Discover More Tips and Techniques on This Blog

How to avoid data set merging problems when common BY variable has different lengths?

Basic Differences Between Proc MEANS and Proc SUMMARY

Exploring the Analysis Data Model – ADaM Datasets

How to use MISSING(), NMISS() and the CMISS() functions

Studyday calculation ( --DY Variable in SDTM)

Disclosure: