Search This Blog

Loading...

Thursday, November 14, 2013

How to avoid data set merging problems when common BY variable has different lengths?

When merging 2 datasets with a common by-variable and when that common variable has different variable length, the merge process produce unexpected results.  If you use SAS 9.2 version like me, then SAS Data step will inform you with the following warning:

WARNING: Multiple lengths were specified for the BY variable ****** by input data sets. This may cause unexpected results.

It is good that at least starting SAS 9.2 version, data step issues a Warning message to inform the programmer. But if you use before versions, it is difficult to notice this potential disaster.  

When you see this WARNING message in the SAS log, we might be inclined to ignore this warning because we think this is just a WARNING never realizing the potential danger. When you see this message in the LOG we should be thinking about this instead of avoiding because SAS will do exactly what it states: it may cause unexpected results. In some cases merge won’t even happen between datasets and sometimes the partial merge between the datasets.

Let’s look at the following example.
data table1;
          length visit $13; * LENGTH IS 13;
          visit = "CYCLE 1 DAY 1";
          visitnum = 1;
run;

data table2;
          length visit $14; * LENGTH IS 14;
          visit = "CYCLE 1 DAY 10";
          visitnum = 3;
run;

proc sort data=table1;      by visit;run;
proc sort data=table2;      by visit;run;

TABLE 1;
VISIT
VISITNUM
CYCLE 1 DAY 1
1

TABLE 2;
VISIT
VISITNUM
CYCLE 1 DAY 10
3

*Merge 2 datasets together with common BY variable with different lengths;
data table12;
          merge table1 table2;
          by visit;
run;

*Here is the LOG FILE;

2714
2715  data table1_2;
2716            merge table1 table2;
2717            by visit;
2718  run;

WARNING: Multiple lengths were specified for the BY variable VISIT by input data sets. This may cause unexpected results.
NOTE: There were 1 observations read from the data set WORK.TABLE1.
NOTE: There were 1 observations read from the data set WORK.TABLE2.
NOTE: The data set WORK.TABLE1_2 has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

As a result of different lengths, SAS adds only one record to the output dataset rather than 2.

*WRONG OUTPUT CREATED;
VISIT
VISITNUM
CYCLE 1 DAY 1
3

*CORRECT OUTPUT SHOULD BE;
VISIT
VISITNUM
CYCLE 1 DAY 1
1
CYCLE 1 DAY 10
3


*To avoid this potential disaster, I’ve used Proc SQL and then created a macro variable with the maximum length of Visit variable in all the datasets in the WORK directory.;

proc sql;
          select max(length) into :maxlength
          from sashelp.vcolumn
          where libname='WORK'
          and name="VISIT";
quit;

*Visit length form TABLE1 is 13 and from TABLE2 is 14, so since I know the maximum length I will used that in here;

data table1_2;
          length visit $ &maxlength;
          merge table1 table2;
          by visit;
run;

*THIS RESULTS IN CORRECT OUTPUT;
VISIT
VISITNUM
CYCLE 1 DAY 1
1
CYCLE 1 DAY 10
3


continue reading "How to avoid data set merging problems when common BY variable has different lengths?"

Thursday, July 25, 2013

Basic Differences Between Proc MEANS and Proc SUMMARY

Though Proc Means and Proc Summary are 2 different procedures essentially used to compute descriptive statistics of numeric variables, but there are differences between these two. ( 
1)  By default, Proc MEANS  produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  
2) Proc Summary only produces the descriptive statistics for the variables that are specified in the VAR statement, where as Proc Means by default, computes the descriptive statistics of the numeric variables even without the VAR statement.
Here is a post which details the differences:
Excerpt
Proc SUMMARY and Proc MEANS are essentially the same procedure.  Both procedures compute descriptive statistics.  The main difference concerns the default type of output they produce.  Proc MEANS by default produces printed output in the LISTING window or other open destination whereas Proc SUMMARY does not.  Inclusion of the print option on the Proc SUMMARY statement will output results to the output window.
The second difference between the two procedures is reflected in the omission of the VAR statement.  When all variables in the data set are character the same output: a simple count of observations, is produced for each procedure.  However, when some variables in the dataset are numeric, Proc MEANS analyses all numeric variables not listed in any of the other statements and produces default statistics for these variables (N, Mean, Standard Deviation, Minimum and Maximum). 


continue reading "Basic Differences Between Proc MEANS and Proc SUMMARY"
ShareThis