Thursday, November 14, 2013

How to avoid data set merging problems when common BY variable has different lengths?

When merging 2 datasets with a common by-variable and when that common variable has different variable length, the merge process produce unexpected results.  If you use SAS 9.2 version like me, then SAS Data step will inform you with the following warning:

WARNING: Multiple lengths were specified for the BY variable ****** by input data sets. This may cause unexpected results.

It is good that at least starting SAS 9.2 version, data step issues a Warning message to inform the programmer. But if you use before versions, it is difficult to notice this potential disaster.  

When you see this WARNING message in the SAS log, we might be inclined to ignore this warning because we think this is just a WARNING never realizing the potential danger. When you see this message in the LOG we should be thinking about this instead of avoiding because SAS will do exactly what it states: it may cause unexpected results. In some cases merge won’t even happen between datasets and sometimes the partial merge between the datasets.

Let’s look at the following example.
data table1;
          length visit $13; * LENGTH IS 13;
          visit = "CYCLE 1 DAY 1";
          visitnum = 1;
run;

data table2;
          length visit $14; * LENGTH IS 14;
          visit = "CYCLE 1 DAY 10";
          visitnum = 3;
run;

proc sort data=table1;      by visit;run;
proc sort data=table2;      by visit;run;

TABLE 1;
VISIT
VISITNUM
CYCLE 1 DAY 1
1

TABLE 2;
VISIT
VISITNUM
CYCLE 1 DAY 10
3

*Merge 2 datasets together with common BY variable with different lengths;
data table12;
          merge table1 table2;
          by visit;
run;

*Here is the LOG FILE;

2714
2715  data table1_2;
2716            merge table1 table2;
2717            by visit;
2718  run;

WARNING: Multiple lengths were specified for the BY variable VISIT by input data sets. This may cause unexpected results.
NOTE: There were 1 observations read from the data set WORK.TABLE1.
NOTE: There were 1 observations read from the data set WORK.TABLE2.
NOTE: The data set WORK.TABLE1_2 has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

As a result of different lengths, SAS adds only one record to the output dataset rather than 2.

*WRONG OUTPUT CREATED;
VISIT
VISITNUM
CYCLE 1 DAY 1
3

*CORRECT OUTPUT SHOULD BE;
VISIT
VISITNUM
CYCLE 1 DAY 1
1
CYCLE 1 DAY 10
3


*To avoid this potential disaster, I’ve used Proc SQL and then created a macro variable with the maximum length of Visit variable in all the datasets in the WORK directory.;

proc sql;
          select max(length) into :maxlength
          from sashelp.vcolumn
          where libname='WORK'
          and name="VISIT";
quit;

*Visit length form TABLE1 is 13 and from TABLE2 is 14, so since I know the maximum length I will used that in here;

data table1_2;
          length visit $ &maxlength;
          merge table1 table2;
          by visit;
run;

*THIS RESULTS IN CORRECT OUTPUT;
VISIT
VISITNUM
CYCLE 1 DAY 1
1
CYCLE 1 DAY 10
3


Thursday, July 25, 2013

Basic Differences Between Proc MEANS and Proc SUMMARY

Though Proc Means and Proc Summary are 2 different procedures essentially used to compute descriptive statistics of numeric variables, but there are differences between these two. ( 
1)  By default, Proc MEANS  produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  
2) Proc Summary only produces the descriptive statistics for the variables that are specified in the VAR statement, where as Proc Means by default, computes the descriptive statistics of the numeric variables even without the VAR statement.
Here is a post which details the differences:
Excerpt
Proc SUMMARY and Proc MEANS are essentially the same procedure.  Both procedures compute descriptive statistics.  The main difference concerns the default type of output they produce.  Proc MEANS by default produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  Inclusion of the print option on the Proc SUMMARY statement will output results to the output window.
The second difference between the two procedures is reflected in the omission of the VAR statement.  When all variables in the data set are character the same output: a simple count of observations, is produced for each procedure.  However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables not listed in any of the other statements and produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum). 


Thursday, March 14, 2013

Exploring the Analysis Data Model – ADaM Datasets

Today, I stumbled upon a blog which is interesting and resourceful.  I liked the article so much so want to share with all my friends here.

Here is the direct link for the post to download or to review:
Actual Article:


The Analysis Data Model (ADaM) is a standard released by the Clinical Data Interchange Standards Consortium (CDISC) and has quickly become widely used in the submission of clinical trial information. ADaM has very close ties to another of CDISCs released standards, Study Data Tabulation Model (SDTM).
The main difference between these two CDISC standards is the way in which the data is displayed. SDTM provides a standard for the creation and mapping of collected data from Raw sources, where as ADAM provides a standard for the creation of analysis-ready data, often using SDTM data as the source.
The purpose of the analysis-ready ADaM data is to provide the programmer with a means to create tables, listings and figures with minimal time and effort whilst ensuring a clear level of traceability in the derived values. This is a key factor of ADaM data as there is a need for a clear and unambiguous flow from the study tabulation data to the analysis data which supports the statistical analyses performed in a clinical study.
CDISC state the following key principles for Analysis Datasets:
  • facilitate clear and unambiguous communication and provide a level of traceability 
  • be useable by currently available tools 
  • be linked to machine-readable metadata 
  • be analysis-ready
To perform statistical analysis on a study, data maybe required from many domains, such as labs, adverse events, demographics and subject characteristics. Bringing this data into ADaM datasets and performing any complex derivations required for display endpoints means that no further data manipulation is required to produce statistical outputs.

When creating the ADaM datasets the requirements of the analyses must be taken into consideration. This will ensure the desired numbers of datasets are produced – at the very least; a subject level dataset is required. Some of the data will be duplicated between domains, for example Age and Gender data. This is acceptable as this will aid the output creation or data review.

The naming convention for the datasets will follow “ADxxxx”, where the “xxxx” part will be sponsor-defined - AE for adverse events, LB for Laboratory results for example. The subject-level dataset, which will be discussed later, will be named “ADSL”. For the ADaM variables, the naming conventions should follow the standardized variable names defined in the ADaM Implementation Guide (ADaMIG). Any variables from the SDTM which are used directly in the ADaM dataset should keep the same variable name to avoid confusion. Sponsor-defined variable names will be given to any other analysis variables. Following these conventions will provide clarity for the reviewer.

As previously mentioned, a key requirement for ADaM data is a subject-level analysis dataset. This dataset and its documentation are always required – even if no other data is submitted. The subject-level dataset, or “ADSL” as it is named within ADaM conventions, contains a record for each subject with variables which display key information for subject disposition, demographic, and baseline characteristics. 

Other variables within ADSL will contain planned or actual treatment group information as well key dates and times of the subjects study participation on the study. Not all variables within ADSL may be used directly for analysis but could be used in conjunction with other datasets for display or grouping purposes or possibly included simply as variables of interest for review.

To conclude, the CDISC summary of ADSL is as follows: “The critical variables in ADSL will include those that are either descriptive, known to affect the subject’s response to drug (in terms of either efficacy or safety), used as strata for randomization, or identify the subject or event as belonging to specific subgroups (e.g. population flags). For example, subjects may be randomized after being stratified by age group because it is believed that younger subjects respond differently to the study drug. In this situation, a subject’s age category would be considered a critical variable for a study and included in ADSL.

I hope you guys liked it.

Sarath

Saturday, January 26, 2013

How to use MISSING(), NMISS() and the CMISS() functions



SAS provides several functions to test for missing values but in this post we will focus on MISSING(), CMISS() and NMISS() functions. The NMISS() function is reserved for numeric variables. The MISSING() and CMISS() functions can be used with either character or numeric variables.  The CMISS() and NMISS() functions are designed by SAS to count the number of arguments with missing values whereas the MISSING function checks whether or not a variable is missing. The MISSING(), CMISS(), and  NMISS() functions provide a simple approach to check for missing values and these functions let you write few lines of code by avoiding large if-statements when you need to check for missing values in several values at the same time.

MISSING() function is very useful when you need to check any variable has a missing value or not, but not sure if it’s a character or numeric? MISSING function works for either character or numeric variables and it also checks for the special numeric missing values (.A, .B,.C ._ etc)as well. The MISSING() function produces a numeric result (0 or 1) if the data point is present or missing. MISSING(varname) is the same as MISSING(varname)=1.  MISSING(varname)=0 specifies when the data point is present.

The MISSING function is particularly useful if you use special missing values since 'if varname=.' will not identify all missing values in such cases.

NOTE: Missing value is not consistent in SAS as it changes between numeric and character variables. A single period (.) represents the Numeric missing value. A single blank enclosed in single or double quotes (' ' or “  ” ) represents the Character missing value. A single period followed by a single letter or an underscore (ex: .A, .B, .Z, ._) represents Special numeric missing values. Please note that these special missing values available for numeric variables only.

The NMISS() function will count the number of arguments with missing values in the specified list of numeric variables. NMISS() function is very useful if you want to make sure if at least one variable in the list is not missing.

The CMISS() is available with SAS 9.2 and SAS Enterprise Guide 4.3 and is similar to the NMISS() function. The only difference is that it counts the number arguments that are missing for both character and numeric variables.

The NMISS() function returns the number of argument variables which have missing values. NMISS works with multiple numeric values, whereas MISSING works with only one value that can be either numeric or character.

Examples:
* count the number of the variables A, B, and C which have missing values;
count=nmiss(A, B, C);
count=nmiss(of A B C);

* count the number of the variables from Var1 to Var10 which have missing values;
count=nmiss(of var1-var10);


Examples:
x1=nmiss(1,0,.,2,5,.);
2
x2=nmiss(1,0);
0
x3=nmiss(of x1-x2);
0

For more details refer to this page. (USING the CMISS, NMISS and MISSING FUNCTIONS)
For more details regarding the special missing values, please also refer to Special Missing Values in SAS (http://studysas.blogspot.com/2010/04/special-missing-values.html).

References:
1)     Missing values in SAS (http://www.pauldickman.com/teaching/sas/missing.php);
2)     MISSING! - Understanding and Making the Most of Missing Data: SUGI 31: Suzanne M. Humphreys, PRA International, Victoria, BC (Canada).
3)     Special Missing Values in SAS (http://studysas.blogspot.com/2010/04/special-missing-values.html)
4)     Usage Note 36480 KNOWLEDGE BASE / SAMPLES & SAS NOTES from support.sas.com
5)     SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition.
6)     Carpenter's Guide to Innovative SAS Techniques, Art Carpenter (Page:99)




Saturday, January 12, 2013

Studyday calculation ( --DY Variable in SDTM)




USE OF THE “STUDY DAY” VARIABLES

The permissible Study Day variables (--DY, --STDY, and --ENDY) describe the relative day of the observation starting with the reference date as Day 1. They are determined by comparing the date portion of the respective date/time variables (--DTC, --STDTC, and --ENDTC) to the date portion of the Subject Reference Start Date (RFSTDTC from the Demographics domain).

The Subject Reference Start Date (RFSTDTC) is designated as Study Day 1. The Study Day value is incremented by 1 for each date following RFSTDTC. Dates prior to RFSTDTC are decremented by 1, with the date preceding RFSTDTC designated as Study Day -1 (there is no Study Day 0). This algorithm for determining Study Day is consistent with how people typically describe sequential days relative to a fixed reference point, but creates problems if used for mathematical calculations because it does not allow for a Day 0. As such, Study Day is not suited for use in subsequent numerical computations, such as calculating duration. The raw date values should be used rather than Study Day in those calculations.

Reference: Study Data Tabulation Model Implementation Guide v3.1.2 (Page No 40).

You will find that  you need to create --DY and or --STDY /--ENDY varianles in almost all the SDTM domains. Because the process of the derivation is same, it makes sense to create a macro code and use it across all the domains...

/****************************************************************

*Study Number              :ABCD_0123
*Sponsor Protocol Number   : ABC1004
*Program Name              : studyday.sas
*Program Location          : X:\PROJECT\DEPT\ABC1004\Progs\macros
*Description               : StudyDAY Macro 
*Program Author            : Sarath Annapareddy
*Creation Date             : 13-Jul-2012
*Macro Parameters: 
  rfdate: --DTC variable used to calculate  Study day variable.
  var   : --DTC variable used to calculate the  Study day  to.
  dy    : Prefix of the Study day variable
 dsn    : Dataset in which the --DTC variable used to calculate the  Study day  to exists.

*Notes: Macro must be used outside the datastep.

****************************************************************;
/*************            Setup Section            ************/
/**************************************************************/




%macro make_studyday(dsn,var,dy,rfdate);

*Getting the Baseline or Reference start date from DM dataset;
proc sort data=interim.dm out=dm(keep=usubjid rfstdtc);  
by usubjid;
run;

proc sort data=&dsn;
      by usubjid;
run;

data &dsn;
        merge &dsn (in=a) dm;
    by usubjid;
    if a;
/*Numeric date variable;*/
       &rfdate._n=input(substr(&rfdate,1,10),anydtdte10.);
       &var._n=input(substr(&var,1,10),anydtdte10.);

/*Study day derivation;*/
if nmiss(&var._n,&rfdate._n)=0 then &dy=&var._n-&rfdate._n+(&var._n>=&rfdate._n); 
run;
%mend;



A sample macro call of this SAS macro for the Adverse Events (AE) domain might look like this:

%make_studyday(ae,aestdtc,aestdy,rfstdtc);
%make_studyday(ae,aeendtc,aeendy,rfstdtc);
%make_Gstudyday(ae,aedtc,aedy,rfstdtc);





For pre-dose:
         studyday= the event/visit date – first dose date
For post-dose:
           studyday= the event/visit date – first dose date + 1







Thursday, November 29, 2012

Creating Custom or Non-Standard CDISC SDTM Domains

Here is the nice article about creating custom SDTM domains.........

Creating Custom or Non-Standard CDISC SDTM Domains


Within the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM), standard domains are split into four main types: special purpose, relationships, trial design and general observation classes. General observation classes cover the majority of observations collected during a study and can be divided among three general classes:
  • The Interventions class captures investigational, therapeutic and other treatments that are administered to the subject (with some actual or expected physiological effect) either as specified by the study protocol (e.g., “exposure”), coincident with the study assessment period (e.g., “concomitant medications”), or other substances self-administered by the subject (such as alcohol, tobacco, or caffeine).
  • The Events class captures planned protocol milestones such as randomization and study completion, and occurrences, conditions, or incidents independent of planned study evaluations occurring during the trial (e.g., adverse events) or prior to the trial (e.g., medical history).
  • The Findings class captures the observations resulting from planned evaluations to address specific tests or questions such as laboratory tests, ECG testing, and questions listed on questionnaires.
When creating a custom domain, one should first confirm that there are no published domains available that the data can fit with. This can be done by checking against the reserved domain codes listed in the appendices of the SDTM implementation guide or checking the CDISC website for any recently published domains. The following list of points are not acceptable when creating custom domains:
  • If there is a common topic where the nature of the data is the same as another published domain.
  • If the custom domain is due to separation based on time.
  • If the data have been collected or are going to be used for different reasons. For example, if a lab parameter is collected for efficacy purposes the data must be represented in the LB domain and not in a custom ‘efficacy’ domain.
  • Data that were collected on separate CRF modules or pages and may fit into an existing domain.
  • If it is necessary to represent relationships between data that are hierarchical in nature. The use of RELREC can be utilized instead,
Once it is confirmed that the data does not fit with any published domains, it should be determined which of the three general observation classes best fits the topic of the data. The custom domain must fit in to one of the three general observation classes. The next step is to determine a two-letter domain code for the custom domain. Note that this should not be the same as any already published or in discussion domain code. The domain codes X-, Y- and Z- are reserved for sponsor use, where the hyphen may be replaced by any letter or number. This domain code will be the name of the domain and will also be used to replace all prefixes of variables for the class. The following steps can then be followed to create the custom domain:
  1. Select and include the required Identifier variables (STUDYID, DOMAIN, USUBJID and --SEQ) and any permissible Identifier variables (--GRPID, --REFID and --SPID).
  2. Include the Topic variable from the identified general observation class (--TRT for interventions, --TERM for events and --TESTCD for Findings).
  3. Select and include the relevant Qualifier variables from the identified general observation class only. These can be found in Section 2.2.1, 2.2.2 and 2.2.3 of the Study Data Tabulation Model.
  4. Select and include the applicable Timing variables. These can be found in Section 2.2.5 of the Study Data Tabulation Model and relate to all general observation classes.
  5. Set the order of the variables within the domain: identifiers must be followed by topic variables, qualifiers and finally timing variables. The variables must then be ordered within these roles to match the order of variables given in Sections 2.2.1, 2.2.2, 2.2.3, 2.2.4 and 2.2.5 of the Study Data Tabulation Model. The variable order in the define.xml must also match the order of the variables within the domain.
    6. Adjust the labels of the variables only as appropriate to properly convey the meaning in the context of the data being submitted in the newly created domain. Use title case for all labels.
  6. Ensure that appropriate standard variables are being properly applied by comparing the use of variables in standard domains.
  7. Ensure that there are no sponsor-defined variables added to the domain. Any sponsor-defined variables should be in a Supplemental Qualifier dataset.
  8. Variable attributes within the domain and Supplemental Qualifier dataset must conform to the SAS Version 5 transport file conventions. For example, variable names must be no longer than 8 characters, variables labels must be no longer than 40 characters and data value lengths must be no longer than 200 characters. Also, where possible the domain should be less than 400 MB in size, otherwise one should contact their review division before splitting domains as they may accept domains with a larger file size.

References:
Study Data Tabulation Model, Version 1.2; CDISC Submission Data Standards Team.
Study Data Tabulation Model Implementation Guide: Human Clinical Trials, Version 3.1.2; CDISC Submission Data Standards Team.

Source: www.Info.quanticate.com


Wednesday, June 20, 2012

ENCODING=Dataset Option

Let me explain the reason writing this post….

My coworker was having problem reading in a SAS dataset that he got from the Sponsor. It was a SAS dataset encoded with UTF-8 and other coding related stuff.
When he tried to get in the rawdata using Libname statement

libname rawdata  /sas/SAS913/SASDATA/CLIENT /ABC123/raw’;
data datasetname;
set rawdata.datasetname;
run;

When he runs the SAS code above, SAS stops at the current block, and returns an error that looks like this:

ERROR: Some character data was lost during transcoding in the dataset RAWDATA.DATSETNAME.

NOTE: The data step has been abnormally terminated.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: SAS set option OBS=0 and will continue to check statements. This may cause NOTE: No observations in data set.

NOTE: There were 20314 observations read from the data set RAWDATA.DATSETNAME.

WARNING: The data set WORK.DATASETNAME may be incomplete.  When this step was stopped there were 20314 observations and

         67 variables.

NOTE: DATA statement used (Total process time):

      real time           0.53 seconds

      cpu time            0.46 seconds

When he asked me why SAS stops in the middle, we were quick in taking the help of GOOGLE because we never saw this kind of ERROR message in the log. Unfortunately, the GOOGLE showed us so many links which has all the technical details. There were few options we saw in those links and nothing worked.  So after so many trials, we stumbled upon a way or we can say the solution, using ASCIIANY as the encoding option in the LIBNAME statement.
 

libname rawdata  /sas/SAS913/SASDATA/CLIENT /ABC123/raw’ inencoding=asciiany;



If you have only one dataset to use or you know the name of the dataset which has the encoding problem you could use the simple datastep too. Here is how…

data datasetname;set rawdata.datasetname (encoding='asciiany');
run;

If you refer to the SAS reference Guide, you will see this, which explains how this option works….



ENCODING= ANY | ASCIIANY | EBCDICANY | encoding-value

ANY
specifies that no transcoding occurs.
Note: ANY is a synonym for binary. Because the data is binary, the actual encoding is irrelevant.
ASCIIANY
specifies that no transcoding occurs when the mixed encodings are ASCII encodings.
Transcoding normally occurs when SAS detects that the session encoding and data set encoding are different. ASCIIANY enables you to create a data set that SAS will not transcode if the SAS session that accesses the data set has a session that encoding value of ASCII. If you transfer the data set to a machine that uses EBCDIC encoding, transcoding occurs.
EBCDICANY
specifies that no transcoding occurs when the mixed encodings are EBCDIC encodings.
The value for ENCODING= indicates that the SAS data set has a different encoding from the current session encoding. When you read data from a data set, SAS transcodes the data from the specified encoding to the session encoding. When you write data to a data set, SAS transcodes the data from the session encoding to the specified encoding.

For more details refer to the documentation….

Learn how to view SAS dataset labels without opening the dataset directly in a SAS session. Easy methods and examples included!

Quick Tip: See SAS Dataset Labels Without Opening the Data Quick Tip: See SAS Dataset Labels With...