How to avoid data set merging problems when common BY variable has different lengths?
When merging 2 datasets with a common by-variable and when
that common variable has different variable length, the merge process produce
unexpected results. If you use SAS 9.2
version like me, then SAS Data step will inform you with the following warning:
WARNING: Multiple lengths were specified for the BY
variable ****** by input data sets. This may cause unexpected results.
It is good that at least starting SAS 9.2 version, data step
issues a Warning message to inform the programmer. But if you use before
versions, it is difficult to notice this potential disaster.
When you see this WARNING message in the SAS log, we might
be inclined to ignore this warning because we think this is just a WARNING
never realizing the potential danger. When you see this message in the LOG we
should be thinking about this instead of avoiding because SAS will do exactly what
it states: it may cause unexpected
results. In some cases merge won’t even happen between datasets and sometimes
the partial merge between the datasets.
Let’s look at the following example.
data
table1;
length
visit $13; * LENGTH IS 13;
visit = "CYCLE
1 DAY 1";
visitnum = 1;
run;
data
table2;
length
visit $14; * LENGTH IS 14;
visit = "CYCLE
1 DAY 10";
visitnum = 3;
run;
proc
sort data=table1; by
visit;run;
proc
sort data=table2; by
visit;run;
TABLE 1;
VISIT
|
VISITNUM
|
CYCLE 1 DAY 1
|
1
|
TABLE 2;
VISIT
|
VISITNUM
|
CYCLE 1 DAY 10
|
3
|
*Merge 2
datasets together with common BY variable with different lengths;
data
table12;
merge
table1 table2;
by
visit;
run;
*Here is the LOG FILE;
2714
2715 data table1_2;
2716 merge table1 table2;
2717 by visit;
2718 run;
WARNING: Multiple lengths were specified for the
BY variable VISIT by input data sets. This may cause unexpected results.
NOTE:
There were 1 observations read from the data set WORK.TABLE1.
NOTE:
There were 1 observations read from the data set WORK.TABLE2.
NOTE:
The data set WORK.TABLE1_2 has 1 observations and 2 variables.
NOTE:
DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
As a
result of different lengths, SAS adds only one record to the output dataset
rather than 2.
*WRONG OUTPUT CREATED;
VISIT
|
VISITNUM
|
CYCLE 1 DAY 1
|
3
|
*CORRECT OUTPUT SHOULD BE;
VISIT
|
VISITNUM
|
CYCLE 1 DAY 1
|
1
|
CYCLE 1 DAY 10
|
3
|
*To avoid this
potential disaster, I’ve used Proc SQL and then created a macro variable with
the maximum length of Visit variable in all the datasets in the WORK directory.;
proc
sql;
select
max(length) into :maxlength
from
sashelp.vcolumn
where
libname='WORK'
and
name="VISIT";
quit;
*Visit length form
TABLE1 is 13 and from TABLE2 is 14, so since I know the maximum length I will
used that in here;
data
table1_2;
length
visit $ &maxlength;
merge
table1 table2;
by
visit;
run;
*THIS RESULTS IN CORRECT OUTPUT;
VISIT
|
VISITNUM
|
CYCLE 1 DAY 1
|
1
|
CYCLE 1 DAY 10
|
3
|