Tuesday, February 3, 2009

How to remove the duplicate observations from the dataset using PROC SQL, DATASTEP/PROC SQL/or PROC SORT approaches ?

How to remove the duplicate observations in the dataset using PROC SQL, DATASTEP/PROC SQL/or PROC SORT etc?

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).


If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by _all_ statement:

proc sort data=dsn noduprecs;
by _all_;
run;

NODUPRECS compares all the variables in the data set and delete exact duplicates.

PROC SQL approach:

Proc SQL noprint;
create table unique as select distinct (*) from dsn;
quit;


Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

If the observations arenʼt the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:

PROC SQL approach:

proc SQL noprint;
create table unique as select distinct (usubjid) from dsn;
quit;


by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

Proc sort data=dsn nodupkey;
by usubjid;
run;


NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

PROC FREQ approach:

Proc freq data=dsn noprint;
tables usubjid/out=unique (keep=usubjid count where=(count=1));
run;


Noprint option is required because we donʼt want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

Using Datastep approach:
This code keeps only unique observations.

proc sort data=dsn out=temp;
by usubjid;
run;
data unique;
set temp;
by usubjid;
if not first.usubjid and last.usubjid;
run;

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

data nodups;
set temp;
by usubjid;
if first.usubjid;
run;

This will.....delete all the duplicate observations... I mean to say... it keeps only first (one)observation for each usubjid variable....

Note: unique and nodups datasets aren't the same.....

*When you want to know how many of them are duplicate observations in the dataset use the following code;

proc sql;

select count(1) as dupobs from (select * from final.aes group by patno having count(patno)>1);
quit;

Open the output folder to see...




7 comments:

Xenurb said...

Hi Sarath,

what is the best way to do update/insert with large data?

Thanks

Anonymous said...

Hi Sarath,
I think the solution to remove duplicates using SAS data step will not work because
: if not first.usubjid and last.usubjid;
will actually give you duplicates
My answer is use : if first.usubjid this condition will give you only the unique observations.

RG said...

If dataset is sorted then data step will be the fastest otherwise go by proc sort.

Amrit said...

Proc SQL noprint;
create table unique as select distinct (*) from dsn;
quit;

should be as follows
Proc SQL noprint;
create table unique as select distinct * from dsn;
quit;

Nicole C said...

Hello all,
This has been really helpful! I had a related question. What if you have a single file that data was double-entered into (there are 2 rows for each individual) and you want to separate it into 2 new files, each with one copy of each individual? Basically, I want to dedup on a single variable ID, but keep the dups in a separate file instead of just deleting them. Any suggestions?

sarath said...

Hi Nicole, Use DUPOUT option to create a new dataset with only duplicate records.

proc sort data=hasdups nodupkey dupout=dupsonly;
by vars;
run;

Post a Comment

ShareThis