Friday, March 27, 2009

Proc Sort NODUP vs NODUPKEY

The SORT Procedure (Proc Sort): Options




We can do many things using PROC SORT like

create a new data setsubset the data rename/ drop/ keep variablesformat, or label your variables etc



Options Available with Proc Sort:

  • OUT= OPTION
  • DESCENDING OPTION
  • DROP=, KEEP=, AND RENAME= OPTIONS
  •  FORMAT AND LABEL STATEMENTS
  • WHERE= OPTION OR WHERE STATEMENT
  • FIRSTOBS= AND OBS= OPTIONS
  • NODUPRECS AND NODUPKEY OPTIONS
  • DUPOUT


A common interview question for SAS jobs is "What is the difference between proc sort nodup and proc sort nodupkey?". The answer the interviewer is expecting is usually "proc sort nodup gets rid of duplicate records with the same sort key but proc sort nodupkey gets rid of other records with the same sort key". However, this is not correct.




Common Programming Mistake with Proc Sort NODUPRECS -


Equivalent of NODUPKEY in PROC SQL

Ian Whitlock Explains...

NODUPKEY is like FIRST. processing. Both depend on order which is an alien

concept to SQL. SQL depends on information stored as data, not in variable
names and not in order.

If for example you had the variables GROUP and SEQ where SEQ is a sequence
number within group, then you could use a GROUP BY GROUP clause with HAVING
SEQ=1. In short there must be something in the data values that indicates
which records you want.

The suggestion to use DISTINCT works when all records having the same value
of GROUP have all other relevant variables with equal values. This would be
equivalent to the NODUP option in PROC SORT.







0 comments:

Post a Comment

ShareThis