Tuesday, February 3, 2009

Mastering Duplicates Removal in SAS: A Comprehensive Guide to Using PROC SQL, DATA STEP, and PROC SORT

Removing Duplicate Observations in SAS: A Comprehensive Guide

Removing Duplicate Observations in SAS: A Comprehensive Guide

In data analysis, it's common to encounter datasets with duplicate records that need to be cleaned up. SAS offers several methods to remove these duplicates, each with its strengths and suitable scenarios. This article explores three primary methods for removing duplicate observations: using PROC SQL, the DATA STEP, and PROC SORT. We will provide detailed examples and discuss when to use each method.

Understanding Duplicate Observations

Before diving into the methods, let's clarify what we mean by duplicate observations. Duplicates can occur in different forms:

  • Exact Duplicates: All variables across two or more observations have identical values.
  • Key-Based Duplicates: Observations are considered duplicates based on the values of specific key variables (e.g., ID, Date).

The method you choose to remove duplicates depends on whether you are dealing with exact duplicates or key-based duplicates.

Approach 1: Removing Duplicates with PROC SQL

PROC SQL is a versatile tool in SAS, allowing you to execute SQL queries to manipulate and analyze data. When removing duplicates, you can use the SELECT DISTINCT statement or apply more complex conditions.

Example 1: Removing Exact Duplicates

proc sql;
    create table no_duplicates as
    select distinct *
    from original_data;
quit;

This code removes all exact duplicates, creating a new dataset no_duplicates that contains only unique records. The SELECT DISTINCT * statement ensures that every unique combination of variable values is retained only once.

Example 2: Removing Duplicates Based on Key Variables

proc sql;
    create table no_duplicates as
    select distinct ID, Name, Age
    from original_data;
quit;

Here, duplicates are removed based on the combination of the ID, Name, and Age variables. This is useful when you want to keep unique records for specific key variables, ignoring other variables in the dataset.

Advantages of PROC SQL:

  • Flexibility: PROC SQL can handle complex queries, allowing you to remove duplicates based on multiple or complex criteria.
  • Powerful Filtering: SQL allows you to apply conditions and filters easily, making it easier to control the exact duplicates you want to remove.

Disadvantages of PROC SQL:

  • Performance: The SELECT DISTINCT statement can be slower with very large datasets, as it requires scanning the entire dataset to identify unique records.
  • Complexity: SQL syntax may be less intuitive for those who are more comfortable with traditional SAS programming.

Approach 2: Removing Duplicates with the DATA STEP

The DATA STEP in SAS provides a programmatic approach to removing duplicates, giving you fine-grained control over the process. This method typically involves sorting the dataset first and then using conditional logic to remove duplicates.

Example 1: Removing Exact Duplicates

To remove exact duplicates, you must first sort the data by all variables and then use the DATA STEP to retain only the first occurrence of each observation.

proc sort data=original_data noduprecs out=sorted_data;
    by _all_;
run;

data no_duplicates;
    set sorted_data;
run;

The noduprecs option in PROC SORT removes exact duplicate records. The sorted and deduplicated dataset is then saved as no_duplicates.

Example 2: Removing Duplicates Based on Key Variables

If you want to remove duplicates based on specific key variables, you can sort the data by those variables and use the first. or last. functions in the DATA STEP to control which duplicates are kept.

proc sort data=original_data;
    by ID;
run;

data no_duplicates;
    set original_data;
    by ID;
    if first.ID;
run;

In this example, the dataset is first sorted by the ID variable. The first.ID statement ensures that only the first occurrence of each ID is kept, removing any subsequent duplicates.

Advantages of the DATA STEP:

  • Fine-Grained Control: The DATA STEP allows you to apply custom logic to the deduplication process, such as retaining the first or last occurrence based on additional criteria.
  • Efficiency: When dealing with large datasets, this method can be more efficient, especially if you need to apply complex logic.

Disadvantages of the DATA STEP:

  • Manual Sorting: You need to sort the data before removing duplicates, adding an extra step to the process.
  • Complexity: The logic required to remove duplicates can be more complex and less intuitive than using PROC SORT.

Approach 3: Removing Duplicates with PROC SORT

PROC SORT is one of the simplest and most commonly used methods for removing duplicates in SAS. This approach sorts the data and can automatically remove duplicates during the sorting process.

Example 1: Removing Exact Duplicates

proc sort data=original_data noduprecs out=no_duplicates;
    by _all_;
run;

Here, PROC SORT with the noduprecs option removes exact duplicates. The by _all_ statement ensures that the sort is applied to all variables, making the deduplication based on the entire record.

Example 2: Removing Duplicates Based on Key Variables

proc sort data=original_data nodupkey out=no_duplicates;
    by ID;
run;

In this case, PROC SORT uses the nodupkey option to remove duplicates based on the ID variable. The out= option specifies that the sorted and deduplicated data should be saved to the no_duplicates dataset.

Advantages of PROC SORT:

  • Simplicity: PROC SORT is straightforward and easy to use, requiring minimal code to remove duplicates.
  • Efficiency: PROC SORT is optimized for sorting and deduplication, making it very fast, especially for large datasets.

Disadvantages of PROC SORT:

  • Limited Flexibility: PROC SORT can only remove duplicates based on sorted keys, which might not be suitable for more complex deduplication needs.
  • No Complex Logic: Unlike the DATA STEP, PROC SORT does not allow you to apply custom logic or conditions during the deduplication process.

Comparison Summary

Each method for removing duplicates in SAS has its strengths and weaknesses:

  • Use PROC SQL when you need flexibility and the ability to apply complex conditions for deduplication, especially when working within a SQL-based framework.
  • Use the DATA STEP if you require precise control over the deduplication process and need to apply custom logic to determine which duplicates to keep.
  • Use PROC SORT for its simplicity and efficiency when dealing with large datasets, particularly when you only need to remove duplicates based on simple keys.

Conclusion

Removing duplicates is a crucial step in data cleaning and preparation, and SAS provides multiple tools to accomplish this task. By understanding the differences between PROC SQL, the DATA STEP, and PROC SORT, you can choose the most appropriate method for your specific data processing needs. Whether you need flexibility, control, or efficiency, SAS offers the right approach to ensure your data is clean and ready for analysis.

Monday, February 2, 2009

How to scan more than 20 records to determine variable attributes

Usage Note 1075: How to scan more than 20 records to determine variable attributes in EFI

In Versions 7 and 8 of the SAS System, by default the Import Wizard, PROC IMPORT and the External File Interface (EFI) scan 20 records to determine variable attributes when reading delimited text files.

Changing the default setting can only be done for EFI in Version 7, Release 8 and Release 8.1. Beginning in Release 8.2 changing the default setting is applicable to the Import Wizard, PROC IMPORT and EFI.

Follow the steps below to change the default behavior:

1) Type regedit on the command line (white box with a check mark)

2) When the Registry Editor window opens, double click on the PRODUCTS icon

3) Double click on the BASE icon

4) Double click on the EFI icon

5) In the window on the right the Contents of EFI will be populated with EFI options

6) Double click on GuessingRows

7) When the new window opens with the old value of 20, delete it, enter the new value, and click on OK.

8) Close the Registry Editor window

9) Invoke the Import Wizard, PROC IMPORT or EFI to use the new GuessingRows value

The new value entered for GuessingRows will remain until you change it.

Beginning in SAS 9.1 there is a new statement, GUESSINGROWS=, that can be specified with PROC IMPORT. By specifying the GUESSINGROWS= statement with PROC IMPORT, you do not have to change the GuessingRows value in the SAS Registry. Also beginning in SAS 9.1 you can specify the Number of Rows to Guess when using the Import Wizard in the SAS

Import: Delimited File Options Window or when using EFI in the Options for Import Window.

source:http://support.sas.com/kb/1/075.html


Saturday, January 3, 2009

SAS sample programs

SAS sample code programs:

Macro for sorting the variables:

How to convert character date values into numeric date values using DATASTEP/PROC SQL and ARRAYS:

How to detect missing values using Arrays:

First. & Last. Variables:

How to determine the last observation in the dataset:

How to determine whether a numeric or character value exists within a group of variables:

Lag Function: How to obtain information from the previous observation:

How to create a new dataset from multiple datasets based on the sorted order:

Dynamically generate SET statement to combine multiple datasets:

How to determine which dataset contributed an observation:

How to determine if a variable exists in a dataset or not:

How to Read Delimited Text Files into SAS:

How to keep only even number observations in the dataset:

How to verify the existence of an external file:

Accurately calculating age in one line code:

How to use INDEX/INDEXC functions:

Finding the number of observations in the dataset:

How to capitalize first letter of the every word:/ PROPCASE FUNCTION:

How to use the SCAN function:

Concatenation functions in SAS 9.0:

SOUNDEX Function:

IFC and IFN functions:new IF functions & IF-THEN-ELSE vs SELECT:

How to remove the duplicate observations from the dataset using PROC SQL, DATASTEP/PROC SQL/or PROC SORT approaches ?

How can I count number of observations per subject in a data set?

How to save the log file or what is PROC PRINTTO procedure

How to calculate number of years and number of days between 2 dates;

How to customize page numbers in RTF output

How to create a comma separated file (.csv) of a SAS dataset?

Change all missing values of all variables into zeros/putting zeros in place of missing values for variables

Friday, January 2, 2009

BASE SAS CERTIFICATION SUMMARY FUNCTIONS

Certification Summary---Functions:
sourec:http://wiki.binghamton.edu/index.php/Certification_Summary---Functions

SAS Functions can be used to convert data and to manipulate the values of character variables. Functions are written by specifying the function name, then it's arguments in parentheses. Arguments can include variables, constants, or expressions. Although arguments are typically separated by commas, they can also be specified as variable lists or arrays.

Contents[hide]

1 YEAR, QTR, MONTH and DAY Functions

2 WEEKDAY Function

3 MDY Function

4 DATE and TODAY Function

5 SUBSTR Function

6 INDEX Function

7 UPCASE Function

8 LOWCASE Function

9 INT Function

10 ROUND Function

11 TRIM Function

12 Some Sample Certification Examples

13 Points to remember


SAS Tip: How to round the numbers:

It's often a requirement to round the values of one variable to a precision defined in a different variable - either another data set variable, or a macro variable. The easiest way to achieve this is:


cooked = round(raw, 10**-dp);

In this example, the required number of decimal places is set in the dataset variable dp. The ** operator is used to convert from a number of decimal places to a suitable rounding unit for the round() function: for example, when dp is set to 3, 10**-dp becomes 10**-3, which is 10-3 or 0.001.

SAS sample programs

Reading/Writing Files
Making a fixed format file
Making a SAS Cport file
Reading a SAS Cport file
Reading multiple raw data files, Version 8
Reading multiple raw data files (version 6.x)
Using a SAS macro to "set" multiple files

Other
Imputing the median
Checking for duplicate Ids
Macro to compute a rolling standard deviation
Changing the length of a character variable
Replacing strings
Concatenating string variables using CAT functions
Simple macro to do repeated procs
Eliminate useless variables
Matching husbands and wives
Creating a wide table from a long dataset using PROC TABULATE
How can I "fill down" a variable?
Creating a long list of variable names based on an abbreviated one
Filling in Missing Values Over Time
Dummy Coding a Categorical Variable Using a Macro Program
A few SAS macro programs for renaming variables dynamically

source: http://oregonstate.edu/dept/statistics/sasclass/examples.htm

Creating SAS datasets
Read a SAS dataset
Create a SAS dataset from raw data
List input
Column input
Formated input
Mixed input
Include data in program
Create permanent SAS dataset Working with SAS datasets,
Assignment statements
Functions
IF-THEN-ELSE, SELECT and LABEL statements
Subsetting data
DO-WHILE loop
DO-UNTIL loop Procedures
PROC PRINT and PROC SORT
PROC MEANS, PROC SORT and PROC PRINT
PROC FREQ
PROC UNIVARIATE
PROC REG
PROC GLM
PROC GPLOT


source: http://www.stattutorials.com/SASDATA/

o SAS Program Files
o PROCMEANS1.SAS
o PROCMEANS2.SAS
o PROCMEANS3.SAS
o PROCMEANS4.SAS
o PROCMEANS5.SAS (output)
o PROCUNI.SAS (univariate)
o PROCUNI2.SAS (advanced univariate)
o PROCUNI3.SAS (advanced univariate)
o PROCFREQ1.SAS (Frequency table)
o PROCFREQ2.SAS (Data from summarize counts)
o PROCFREQ3.SAS (Goodness of fit)
o PROCFREQ4.SAS (two-way table)
o PROCFREQ5.SAS (2x2 from summary data)
o PROCCORR1.SAS (correlation)
o PROCCORR2.SAS (matrix of scatterplots)
o PROCTTEST1.SAS (two-group t-test)
o PROCTTEST2.SAS (paired t-test)
o PROCANOVA1.SAS (One Way ANOVA, also PROC GLM)
o PROCGLM2.SAS (Repeated Measures ANOVA)
o PROCGLM2a.SAS (Repeated Measures ANOVA)
o PROC-LIFE-1.SAS (Survival Analysis PROC LIFETEST)
o BLAND-ALTMAN.SAS (Bland-Altman Analysis)

o ODS Examples
o ODS1.SAS (Example output without ODS)
o ODS2.SAS (Simple ODS invocation)
o ODS3.SAS (ODS using Science Style (RTF))
o ODS3A.SAS (Same ODS to HTML)
o ODS3B.SAS (Same to PDF)
o ODS4.SAS (t-Test ODS output)
o ODS5.SAS (Simple GCHART/ No ODS)
o ODS5A.SAS (Drill-down bar chart example)
o ODS6.SAS (2x2 Crosstab)
o ODS6A.SAS (2x2 Crosstab with TRACE)
o ODS6B.SAS (2x2 Crosstab/Selected tables)
o ODS7.SAS (Scatterlot matrix/Correlations)
o ODS8.SAS (Regression with ODS graphics output)
o ODS9.SAS (GLM with graphics output/boxplots)
o ODS10.SAS (Discover output names of components)
o ODS10A.SAS (Output ODS data to file)
o ODS10B.SAS (Merge and use ODS output data)
o SASLibrary.pdf (How to create a SAS Library -- required for some of the examples)


o SAS Data Files
o SOMEDATA.SAS7BDAT
o SBPDATA.SAS7BDAT
o LIFE.SAS7BDAT (for LIFETEST)

Here is the lsit  and brief description of available projects. Everyone should do the first 4 projects.


Project 1 An introduction to the SAS operating environment.

Project 2 The basic SAS data step with input of data directly through the cards statement; use of labels, the sort procedure and print procedure; the means procedure.

Project 3 Reading data from ASCII files; computing new variables in the data step; the means procedure.

Project 4 Modifying existing SAS data sets using set; using loops in the data step; the ttest procedure.

Project 5 Column-wise input; analysis of categorical data using chi-square tests.

Project 6 Updating existing SAS data sets with new data.

Project 7 Basics of presentation quality graphics with proc gplot and proc g3d.

Project 8 Basic one factor analysis of variance using proc GLM.

Project 9 Advanced analysis of variance, custom hypothesis tests, and other features of proc GLM.

Project 10 Multivariate analysis of variance using proc GLM.

Project 11 Basic Box-Jenkins modeling of univariate time series analysis using proc arima (time domain).

Project 12 Some aspects of frequency domain analysis of time series using proc spectra.

Project 13 Discriminant analysis with proc discrim.

Project 14 Reading data from dBase and DIF files; using dBase and DIF files instead of actual SAS datasets.

Project 15 Using arrays, first and last, and processing dates. Repeated measures analysis.

Source: http://javeeh.net

macro for sorting the the datasets

MACRO FOR SORTING:

Rather than using the Proc Sort procedure all the time..... you can just use the following macro....

and call it when you req... to sort any SAS dataset.....

EXAMPLE1:
%macro srt(dtn,keyvar);
proc sort data=&dtn;
by &keyvar;
run;
%mend srt;


%srt(ie,PT IEORRES);

*the above step will tell SAS to sort the IE dataset with the by variables PT and IEORRES respectively.


EXAMPLE2:
*Sometimes we need to create an output dataset during the sorting process i.e in the Proc sort step in such a case use the below macro to do the same;

%MACRO SORT(IN=,VAR=,OUT=);
PROC SORT DATA=&IN OUT=&OUT;
by &VAR;
RUN;
%MEND SORT;

%SORT(IN=CEC1,VAR=PT,OUT=CEC2);
%SORT(IN=DERIVE.HEADER,VAR=PT,OUT=HEADER1);


EXAMPLE3:

*Sometimes we need to use the NODUPKEY option to delete the duplicate observations in the dataset in such a case use the below macro to do the same;

%MACRO SORT(IN,VAR,OUT,OPTN);

PROC SORT DATA=&IN OUT=&OUT &OPTN ;

by &VAR;

RUN;

%MEND SORT;

%SORT(IN=AE, AE1,USIBJID AEBODYSYS AESEV, NODUPKEY);







How to convert numeric date values into character and from character date values into numeric using DATASTEP, PROC SQL and ARRAYS

1) Converting character date values into numeric:

/*I) Using the DATASTEP:*/

1)
Data dates; input cdate $9.; cards;
16-apr-07
01-01-07
02-jun-07
13-sep-07
;
run;

Data Convert; set dates;
Date = input( cdate , ANYDTDTE9.); format date date7.;
run;


2)
Data dates;
input cdate $9.; cards;
16-apr-2007
01-01-2007
02-jun-2007
13-sep-2007
;
run;

Data Convert;
set dates;
Date = input( cdate , ANYDTDTE11.);
format date date9.;
run;

*II) Using Proc SQL;
*Numeric date variable can be converted to character date variable by using the PUT function within PROC SQL.;

proc sql;
create table date_char as
select PUT(date,date9.)as ndate
from date_num; quit;

*Character date variable can be converted to numeric date variable by using the INPUT function within PROC SQL.;
Proc sql;
create table date_num as
select INPUT (date,mmddyy10.)as ndate
format=mmddyy10.
from date_char;
quit;
Or

Proc sql;
create table date_num as
select INPUT (date,date9.)as ndate format=date9.
from date_char;
quit;
III) Using Arrays;

*using arrays to convert character date variables aestdtc and aeendtc into numeric variables:

data c_date;
set date;
array cha_date { 2 } $ 10 aeendtc aestdtc; array num_date { 2 } aeendt aestdt;
do i = 1 to 2;
num_date{ I } = input(cha_date{ i }, anydtdte10.);
end;
drop
aeendtc aestdtc;
run;


*Using Arrays converting variables from numeric to character;
data new; set old (rename=(b1=a1 b2=a2 b3=a3 b4=a4 b5=a5));
array aa $ b1 b2 b3 b4 b5;
array bb a1 a2 a3 a4 a5; do I=1 to dim(aa);
aa (i)= put(bb(i),8.0);
end;
drop I a1 a2 a3 a4 a5 a6; run;

*Using Arrays converting variables from character to numeric;
*By using the INPUT instead of PUT we can convert character variables into numeric;

data new ;
set old (rename=(b1=a1 b2=a2 b3=a3 b4=a4 b5=a5));
array aa b1 b2 b3 b4 b5;
array bb $ a1 a2 a3 a4 a5;
do I=1 to dim (aa);
aa (i)=input(bb(i),8.0);
end;
drop I a1 a2 a3 a4 a5 a6;
run;


Note: Dont forget to use $ sign inside the array statement when converting the vars from Num to char or Char to num.