Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).
If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:
Using the noduprecs option in the PROC SORT with a by _all_ statement:
proc sort data=dsn noduprecs;
NODUPRECS compares all the variables in the data set and delete exact duplicates.
PROC SQL approach:
Proc SQL noprint;
create table unique as select distinct (*) from dsn;
Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.
If the observations arenʼt the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:
PROC SQL approach:
proc SQL noprint;
create table unique as select distinct (usubjid) from dsn;
by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.
The same can be done by another approach i.e use proc sort:
Proc sort data=dsn nodupkey;
NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.
PROC FREQ approach:
Proc freq data=dsn noprint;
tables usubjid/out=unique (keep=usubjid count where=(count=1));
Noprint option is required because we donʼt want the procedure to print all the unique observations. We just want a dataset with all the unique observations.
Using Datastep approach:
This code keeps only unique observations.
proc sort data=dsn out=temp;
if not first.usubjid and last.usubjid;
If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);
This will.....delete all the duplicate observations... I mean to say... it keeps only first (one)observation for each usubjid variable....
Note: unique and nodups datasets aren't the same.....
*When you want to know how many of them are duplicate observations in the dataset use the following code;
select count(1) as dupobs from (select * from final.aes group by patno having count(patno)>1);
Open the output folder to see...