Friday, August 30, 2024

10 Essential SAS Programming Tips for Boosting Your Efficiency

10 Essential SAS Programming Tips for Boosting Your Efficiency

As a SAS programmer, you're always looking for ways to streamline your code, improve efficiency, and enhance the readability of your programs. Whether you're new to SAS or a seasoned pro, these tips will help you optimize your workflows and make the most out of your programming efforts.

Here are ten essential SAS programming tips to elevate your coding skills:

  1. Harness the Power of PROC SQL for Efficient Data Manipulation
    PROC SQL can be a game-changer when it comes to handling complex data manipulations. It allows you to merge datasets, filter records, and create summary statistics all within a few lines of code, making your data processing more concise and effective.

        proc sql;
           select Name, mean(Salary) as Avg_Salary
           from employees
           group by Department
           having Avg_Salary > 50000;
        quit;
        
  2. Simplify Repetitive Tasks with ARRAY
    Repetitive calculations or transformations across multiple variables can clutter your code. Using an ARRAY simplifies these tasks, allowing you to apply changes to multiple variables in a structured and clean manner.

        data new_data;
           set original_data;
           array scores[5] score1-score5;
           do i = 1 to 5;
              scores[i] = scores[i] * 1.1;  /* Applying a 10% increase to all scores */
           end;
        run;
        
  3. Create Dynamic Macro Variables with CALL SYMPUT and CALL SYMPUTX
    Macro variables can make your SAS programs more flexible and reusable. CALL SYMPUT and CALL SYMPUTX allow you to create these variables dynamically during data steps, with CALL SYMPUTX offering the added benefit of trimming spaces.

        data _null_;
           set employees;
           call symputx('emp_count', _n_);
        run;
    
        %put &emp_count;
        
  4. Optimize Subsetting with WHERE Statements
    When subsetting data, WHERE statements are generally more efficient than IF statements. WHERE conditions filter data at the point of reading, which reduces the amount of data loaded into memory, speeding up processing times.

        data subset;
           set employees(where=(Salary > 50000));
        run;
        
  5. Streamline Data Recoding with PROC FORMAT
    PROC FORMAT is an incredibly versatile tool for recoding and grouping values. It enhances your data processing capabilities and improves code readability by allowing you to define and reuse custom formats.

        proc format;
           value salary_fmt
              low - 50000 = 'Low'
              50001 - 100000 = 'Medium'
              100001 - high = 'High';
        run;
    
        proc freq data=employees;
           tables Salary / format=salary_fmt.;
        run;
        
  6. Profile Your Data with PROC CONTENTS and PROC FREQ
    Before diving into analysis, it's crucial to understand the structure and distribution of your data. PROC CONTENTS gives you a detailed overview, while PROC FREQ provides insights into the distribution of categorical variables, helping you identify any data anomalies early on.

        proc contents data=employees; run;
    
        proc freq data=employees;
           tables Department / missing;
        run;
        
  7. Efficiently Manage Variables with KEEP and DROP Statements
    To enhance performance and reduce dataset sizes, selectively keep or drop variables during your data steps. This practice is especially useful when working with large datasets where memory efficiency is crucial.

        data smaller_set;
           set large_set(keep=Name Department Salary);
        run;
        
  8. Concatenate Datasets Seamlessly with PROC APPEND
    When you need to combine datasets, PROC APPEND is often more efficient than using multiple data steps. It appends one dataset to another without re-reading the original data, making it ideal for large datasets.

        proc append base=master_data data=new_data;
        run;
        
  9. Automate Repetitive Tasks with Macro Programming
    Macro programming can dramatically reduce the amount of repetitive code in your SAS programs. By creating macros for commonly used processes, you can maintain consistency and save time, especially when working with similar tasks across multiple datasets.

        %macro process_data(year);
           data processed_&year;
              set raw_data_&year;
              /* Processing steps */
           run;
        %mend process_data;
    
        %process_data(2023);
        %process_data(2024);
        
  10. Debug Efficiently Using SAS OPTIONS
    Debugging is an essential part of the development process. SAS provides several system options like OPTIONS MPRINT;, OPTIONS SYMBOLGEN;, and OPTIONS MLOGIC; that allow you to trace the execution of your code, resolve errors, and understand the values of macro variables.

        options mprint symbolgen mlogic;
        

Tuesday, April 28, 2015

Separating Unique and Duplicate Observations Using PROC SORT in SAS 9.3 and Newer Versions

Today, I stumbled upon a post where the author talks about a new options that are available in SAS 9.3 and later versions. These options (NOUNIQUEKEYS and UNIQUEOUT)  that allows sorting and then finding the duplicate records to be done in one step using PROC SORT.

Direct Link: 

Separating Unique and Duplicate Observations Using PROC SORT in SAS 9.3 and Newer Versions

Christopher J. Bost published a paper in SAS Global Forum 2013 regarding the same option.


Thursday, November 20, 2014

FDA's Official List of Validation Rules for SDTM compliance

Yesterday, FDA published its first official list of validation rules for CDISC SDTM. These long awaited rules cover both conformance and quality requirements, as described in the FDA Study Data Technical Conformance Guide. Conformance validation rules help ensure that the data conform to the standards, while quality checks help ensure the data support meaningful analysis.

For Official list of rules, here is the direct link for the FDA website: http://www.fda.gov/forindustry/datastandards/studydatastandards/default.htm

The FDA is asking sponsors to validate their study data before submission using these published validation rules and either correct any validation issues or explain, why they could not be corrected, in the Study Data Reviewer's Guide. This recommended pre-submission validation step is intended to minimize the presence of validation issues at the time of submission.

Open CDISC is offering a webinar on the official list of validation rules. They are offering 2 sessions, 1 is on Tuesday the December 2 and second one is on Wed, Dec 3.


FDA Official Validation Rules for Submission Data

Wednesday, December 3 at 2:00 PM EST


Click on the above links to register for the webinar.

Best,

Sarath

Thursday, November 14, 2013

How to avoid data set merging problems when common BY variable has different lengths?

When merging 2 datasets with a common by-variable and when that common variable has different variable length, the merge process produce unexpected results.  If you use SAS 9.2 version like me, then SAS Data step will inform you with the following warning:

WARNING: Multiple lengths were specified for the BY variable ****** by input data sets. This may cause unexpected results.

It is good that at least starting SAS 9.2 version, data step issues a Warning message to inform the programmer. But if you use before versions, it is difficult to notice this potential disaster.  

When you see this WARNING message in the SAS log, we might be inclined to ignore this warning because we think this is just a WARNING never realizing the potential danger. When you see this message in the LOG we should be thinking about this instead of avoiding because SAS will do exactly what it states: it may cause unexpected results. In some cases merge won’t even happen between datasets and sometimes the partial merge between the datasets.

Let’s look at the following example.
data table1;
          length visit $13; * LENGTH IS 13;
          visit = "CYCLE 1 DAY 1";
          visitnum = 1;
run;

data table2;
          length visit $14; * LENGTH IS 14;
          visit = "CYCLE 1 DAY 10";
          visitnum = 3;
run;

proc sort data=table1;      by visit;run;
proc sort data=table2;      by visit;run;

TABLE 1;
VISIT
VISITNUM
CYCLE 1 DAY 1
1

TABLE 2;
VISIT
VISITNUM
CYCLE 1 DAY 10
3

*Merge 2 datasets together with common BY variable with different lengths;
data table12;
          merge table1 table2;
          by visit;
run;

*Here is the LOG FILE;

2714
2715  data table1_2;
2716            merge table1 table2;
2717            by visit;
2718  run;

WARNING: Multiple lengths were specified for the BY variable VISIT by input data sets. This may cause unexpected results.
NOTE: There were 1 observations read from the data set WORK.TABLE1.
NOTE: There were 1 observations read from the data set WORK.TABLE2.
NOTE: The data set WORK.TABLE1_2 has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

As a result of different lengths, SAS adds only one record to the output dataset rather than 2.

*WRONG OUTPUT CREATED;
VISIT
VISITNUM
CYCLE 1 DAY 1
3

*CORRECT OUTPUT SHOULD BE;
VISIT
VISITNUM
CYCLE 1 DAY 1
1
CYCLE 1 DAY 10
3


*To avoid this potential disaster, I’ve used Proc SQL and then created a macro variable with the maximum length of Visit variable in all the datasets in the WORK directory.;

proc sql;
          select max(length) into :maxlength
          from sashelp.vcolumn
          where libname='WORK'
          and name="VISIT";
quit;

*Visit length form TABLE1 is 13 and from TABLE2 is 14, so since I know the maximum length I will used that in here;

data table1_2;
          length visit $ &maxlength;
          merge table1 table2;
          by visit;
run;

*THIS RESULTS IN CORRECT OUTPUT;
VISIT
VISITNUM
CYCLE 1 DAY 1
1
CYCLE 1 DAY 10
3


Thursday, July 25, 2013

Basic Differences Between Proc MEANS and Proc SUMMARY

Though Proc Means and Proc Summary are 2 different procedures essentially used to compute descriptive statistics of numeric variables, but there are differences between these two. ( 
1)  By default, Proc MEANS  produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  
2) Proc Summary only produces the descriptive statistics for the variables that are specified in the VAR statement, where as Proc Means by default, computes the descriptive statistics of the numeric variables even without the VAR statement.
Here is a post which details the differences:
Excerpt
Proc SUMMARY and Proc MEANS are essentially the same procedure.  Both procedures compute descriptive statistics.  The main difference concerns the default type of output they produce.  Proc MEANS by default produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  Inclusion of the print option on the Proc SUMMARY statement will output results to the output window.
The second difference between the two procedures is reflected in the omission of the VAR statement.  When all variables in the data set are character the same output: a simple count of observations, is produced for each procedure.  However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables not listed in any of the other statements and produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum). 


Thursday, March 14, 2013

Exploring the Analysis Data Model – ADaM Datasets

Today, I stumbled upon a blog which is interesting and resourceful.  I liked the article so much so want to share with all my friends here.

Here is the direct link for the post to download or to review:
Actual Article:


The Analysis Data Model (ADaM) is a standard released by the Clinical Data Interchange Standards Consortium (CDISC) and has quickly become widely used in the submission of clinical trial information. ADaM has very close ties to another of CDISCs released standards, Study Data Tabulation Model (SDTM).
The main difference between these two CDISC standards is the way in which the data is displayed. SDTM provides a standard for the creation and mapping of collected data from Raw sources, where as ADAM provides a standard for the creation of analysis-ready data, often using SDTM data as the source.
The purpose of the analysis-ready ADaM data is to provide the programmer with a means to create tables, listings and figures with minimal time and effort whilst ensuring a clear level of traceability in the derived values. This is a key factor of ADaM data as there is a need for a clear and unambiguous flow from the study tabulation data to the analysis data which supports the statistical analyses performed in a clinical study.
CDISC state the following key principles for Analysis Datasets:
  • facilitate clear and unambiguous communication and provide a level of traceability 
  • be useable by currently available tools 
  • be linked to machine-readable metadata 
  • be analysis-ready
To perform statistical analysis on a study, data maybe required from many domains, such as labs, adverse events, demographics and subject characteristics. Bringing this data into ADaM datasets and performing any complex derivations required for display endpoints means that no further data manipulation is required to produce statistical outputs.

When creating the ADaM datasets the requirements of the analyses must be taken into consideration. This will ensure the desired numbers of datasets are produced – at the very least; a subject level dataset is required. Some of the data will be duplicated between domains, for example Age and Gender data. This is acceptable as this will aid the output creation or data review.

The naming convention for the datasets will follow “ADxxxx”, where the “xxxx” part will be sponsor-defined - AE for adverse events, LB for Laboratory results for example. The subject-level dataset, which will be discussed later, will be named “ADSL”. For the ADaM variables, the naming conventions should follow the standardized variable names defined in the ADaM Implementation Guide (ADaMIG). Any variables from the SDTM which are used directly in the ADaM dataset should keep the same variable name to avoid confusion. Sponsor-defined variable names will be given to any other analysis variables. Following these conventions will provide clarity for the reviewer.

As previously mentioned, a key requirement for ADaM data is a subject-level analysis dataset. This dataset and its documentation are always required – even if no other data is submitted. The subject-level dataset, or “ADSL” as it is named within ADaM conventions, contains a record for each subject with variables which display key information for subject disposition, demographic, and baseline characteristics. 

Other variables within ADSL will contain planned or actual treatment group information as well key dates and times of the subjects study participation on the study. Not all variables within ADSL may be used directly for analysis but could be used in conjunction with other datasets for display or grouping purposes or possibly included simply as variables of interest for review.

To conclude, the CDISC summary of ADSL is as follows: “The critical variables in ADSL will include those that are either descriptive, known to affect the subject’s response to drug (in terms of either efficacy or safety), used as strata for randomization, or identify the subject or event as belonging to specific subgroups (e.g. population flags). For example, subjects may be randomized after being stratified by age group because it is believed that younger subjects respond differently to the study drug. In this situation, a subject’s age category would be considered a critical variable for a study and included in ADSL.

I hope you guys liked it.

Sarath

Saturday, January 26, 2013

How to use MISSING(), NMISS() and the CMISS() functions



SAS provides several functions to test for missing values but in this post we will focus on MISSING(), CMISS() and NMISS() functions. The NMISS() function is reserved for numeric variables. The MISSING() and CMISS() functions can be used with either character or numeric variables.  The CMISS() and NMISS() functions are designed by SAS to count the number of arguments with missing values whereas the MISSING function checks whether or not a variable is missing. The MISSING(), CMISS(), and  NMISS() functions provide a simple approach to check for missing values and these functions let you write few lines of code by avoiding large if-statements when you need to check for missing values in several values at the same time.

MISSING() function is very useful when you need to check any variable has a missing value or not, but not sure if it’s a character or numeric? MISSING function works for either character or numeric variables and it also checks for the special numeric missing values (.A, .B,.C ._ etc)as well. The MISSING() function produces a numeric result (0 or 1) if the data point is present or missing. MISSING(varname) is the same as MISSING(varname)=1.  MISSING(varname)=0 specifies when the data point is present.

The MISSING function is particularly useful if you use special missing values since 'if varname=.' will not identify all missing values in such cases.

NOTE: Missing value is not consistent in SAS as it changes between numeric and character variables. A single period (.) represents the Numeric missing value. A single blank enclosed in single or double quotes (' ' or “  ” ) represents the Character missing value. A single period followed by a single letter or an underscore (ex: .A, .B, .Z, ._) represents Special numeric missing values. Please note that these special missing values available for numeric variables only.

The NMISS() function will count the number of arguments with missing values in the specified list of numeric variables. NMISS() function is very useful if you want to make sure if at least one variable in the list is not missing.

The CMISS() is available with SAS 9.2 and SAS Enterprise Guide 4.3 and is similar to the NMISS() function. The only difference is that it counts the number arguments that are missing for both character and numeric variables.

The NMISS() function returns the number of argument variables which have missing values. NMISS works with multiple numeric values, whereas MISSING works with only one value that can be either numeric or character.

Examples:
* count the number of the variables A, B, and C which have missing values;
count=nmiss(A, B, C);
count=nmiss(of A B C);

* count the number of the variables from Var1 to Var10 which have missing values;
count=nmiss(of var1-var10);


Examples:
x1=nmiss(1,0,.,2,5,.);
2
x2=nmiss(1,0);
0
x3=nmiss(of x1-x2);
0

For more details refer to this page. (USING the CMISS, NMISS and MISSING FUNCTIONS)
For more details regarding the special missing values, please also refer to Special Missing Values in SAS (http://studysas.blogspot.com/2010/04/special-missing-values.html).

References:
1)     Missing values in SAS (http://www.pauldickman.com/teaching/sas/missing.php);
2)     MISSING! - Understanding and Making the Most of Missing Data: SUGI 31: Suzanne M. Humphreys, PRA International, Victoria, BC (Canada).
3)     Special Missing Values in SAS (http://studysas.blogspot.com/2010/04/special-missing-values.html)
4)     Usage Note 36480 KNOWLEDGE BASE / SAMPLES & SAS NOTES from support.sas.com
5)     SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition.
6)     Carpenter's Guide to Innovative SAS Techniques, Art Carpenter (Page:99)