Monday, February 9, 2009
Options VALIDVARNAME=UPCASE
VALIDVARNAME= option is generally used in SAS whenever we want to control the SAS variable names in the dataset.
VALIDVARNAME= V7 UPCASE ANY
The default option will be VALIDVARNAME=V7 until we specify as UPPERCASE or ANY.
When we mention options VALIDVARNAME=V7, that means we are telling SAS to change the name of the Database column (etc EXCEL sheet column) to valid SAS name with certain rules keeping in mind.
Here are those rules that SAS needs to follow, when it changes the DBMS column name to valid SAS name.
Only 32 mixed case (lower or uppercase) characters are allowed in each variable.
Names should be starting with an underscore or an alphabet (either uppercase or lower case character).
Invalid characters in the DBMS column (ex. $) should be changed to underscores.
See the SAS Language Reference: Dictionary to get more details about the rules.
VALIDVARNAME=UPCASE
When we mention options VALIDVARNAME=UPPERCASE we are telling SAS to change the column name of the Database column to uppercase variables irrespective of type of variables in the DBMS column.
And whenever we want the same kind of characters in SAS dataset which are in the DBMS column (ex .(=) sign and the Asterisk(*) or the forward slash(\) we have to mention options
VALIDVARNAME=ANY
If we do, this will allows any characters which are in the DBMS column to be kept in the SAS dataset.
To understand the concept better here I am giving the example:
Example
The following example shows how the Pass-Through Facility works with
VALIDVARNAME=UPPERCASE.
options validvarname=uppercase;
proc sql;
connect to oracle as tables(user=USERID orapw=passward path=’INSTANCE’);
create table lab as
select lab_rslt, lab_test
from connection to oracle
(select "laboratory result$", "laboratory test$"
from DBMStable);
quit;
When we check the Output we observe that the variables in the DBMS column is changed to upper case as well as V7 (default option) converts those variables into UPPERCASE variables. Ex: " laboratory result$" becomes LAB_RSLT and " laboratory test$" becomes LAB_TEST.
Friday, February 6, 2009
How to merge data sets with a common variable?
For example: col1-col10, dsn1-dsn 7 , or data1 to data10 with common variable of ID.
Considering we have 10 datsets and all of them having the same prefix data;
%macro mymerge (n);
data merged;
merge %do i = 1 % to &n;
data&i %end; ; /* this additional ';' is necessary, the first ';' is for the "%end", while the second ';' is for "Merge"*/;
by id;
run;
%mend;
%mymerge(10)
Merging the data sets using macro code
For example: col1-col10 dsn1-dsn 7 data1 to data6 with common variable of Usubjid.
here is the example, I have 7 datasets i need to merge and each of them having the common variable(usubjid) to merge, and all the datasets having the same prefix dsn(dsn1 to dsn7).
%macro allmerge (n);
data combine;
merge
%do i = 1 % to &n;
dsn&i
%end;
; /* this additional ';' is necessary, the first ';' is for
the "%end", while the second ';' is for "Merge" */
by usubjid;
run;
%mend;
%allmerge (7)
Resolving and Using &&var&i Macro Variables
%let n=05;
%let dsn05=Client; %put &dsn&n; *Resolves to study05;
* Second Scan Resolves to Client;
* Second Scan Resolves to &Client05; *with an error message in the log file....;
/*We assume that macro variable resolution process proceeds from right to left...*/
/*Infact it isn't. */
/*This macro variable will not get resolved, because &client5. macro variable isn't there in the symbol table.; */
%let two=three;
%let three=Check;
%put &one;
%put &&one;
%put &&&one;
%put &&&&one;
%put &&&&&one;
%put &&&&&&one;
%put &&&&&&&one;
%put &&&&&&&&one;
%put &&&&&&&&&one;
%put &&&&&&&&&&one;
Answer:
%put &one; two
%put &&one; two
%put &&&one; three
%put &&&&one; two
%put &&&&&one; three
%put &&&&&&one; three
%put &&&&&&&one; Check
%put &&&&&&&&one; two
%put &&&&&&&&&one; three
%put &&&&&&&&&&one; three
Thursday, February 5, 2009
How can I count number of observations per subject in a data set?
By doing some minor modification we can calculate observations count per subject per visit also. (Just include visit variable in the BY variable list in PROC sort and First. variable list in datastep with SET statement).
For example:
data dsn;
input patid implants;
datalines;
1 3
1 1
1 2
1 1
2 1
2 2
3 1
4 2
3 1
4 5
2 3
1 6
;
run;
proc sort data=dsn;
by patid;
run;
data dsn1;
set dsn;
by patid;
cnt+1;
if first.patid then cnt=1;
run;
proc sort data=dsn1;
by patid descending cnt;
run;
data dsn2;
set dsn1;
by patid;
retain totcnt;
if first.patid then totcnt=cnt;
output;
run;
proc print data=dsn2;
run;
Output:
Tuesday, February 3, 2009
Mastering Duplicates Removal in SAS: A Comprehensive Guide to Using PROC SQL, DATA STEP, and PROC SORT
Removing Duplicate Observations in SAS: A Comprehensive Guide
In data analysis, it's common to encounter datasets with duplicate records that need to be cleaned up. SAS offers several methods to remove these duplicates, each with its strengths and suitable scenarios. This article explores three primary methods for removing duplicate observations: using PROC SQL
, the DATA STEP
, and PROC SORT
. We will provide detailed examples and discuss when to use each method.
Understanding Duplicate Observations
Before diving into the methods, let's clarify what we mean by duplicate observations. Duplicates can occur in different forms:
- Exact Duplicates: All variables across two or more observations have identical values.
- Key-Based Duplicates: Observations are considered duplicates based on the values of specific key variables (e.g., ID, Date).
The method you choose to remove duplicates depends on whether you are dealing with exact duplicates or key-based duplicates.
Approach 1: Removing Duplicates with PROC SQL
PROC SQL
is a versatile tool in SAS, allowing you to execute SQL queries to manipulate and analyze data. When removing duplicates, you can use the SELECT DISTINCT
statement or apply more complex conditions.
Example 1: Removing Exact Duplicates
proc sql;
create table no_duplicates as
select distinct *
from original_data;
quit;
This code removes all exact duplicates, creating a new dataset no_duplicates
that contains only unique records. The SELECT DISTINCT *
statement ensures that every unique combination of variable values is retained only once.
Example 2: Removing Duplicates Based on Key Variables
proc sql;
create table no_duplicates as
select distinct ID, Name, Age
from original_data;
quit;
Here, duplicates are removed based on the combination of the ID
, Name
, and Age
variables. This is useful when you want to keep unique records for specific key variables, ignoring other variables in the dataset.
Advantages of PROC SQL:
- Flexibility:
PROC SQL
can handle complex queries, allowing you to remove duplicates based on multiple or complex criteria. - Powerful Filtering: SQL allows you to apply conditions and filters easily, making it easier to control the exact duplicates you want to remove.
Disadvantages of PROC SQL:
- Performance: The
SELECT DISTINCT
statement can be slower with very large datasets, as it requires scanning the entire dataset to identify unique records. - Complexity: SQL syntax may be less intuitive for those who are more comfortable with traditional SAS programming.
Approach 2: Removing Duplicates with the DATA STEP
The DATA STEP
in SAS provides a programmatic approach to removing duplicates, giving you fine-grained control over the process. This method typically involves sorting the dataset first and then using conditional logic to remove duplicates.
Example 1: Removing Exact Duplicates
To remove exact duplicates, you must first sort the data by all variables and then use the DATA STEP
to retain only the first occurrence of each observation.
proc sort data=original_data noduprecs out=sorted_data;
by _all_;
run;
data no_duplicates;
set sorted_data;
run;
The noduprecs
option in PROC SORT
removes exact duplicate records. The sorted and deduplicated dataset is then saved as no_duplicates
.
Example 2: Removing Duplicates Based on Key Variables
If you want to remove duplicates based on specific key variables, you can sort the data by those variables and use the first.
or last.
functions in the DATA STEP
to control which duplicates are kept.
proc sort data=original_data;
by ID;
run;
data no_duplicates;
set original_data;
by ID;
if first.ID;
run;
In this example, the dataset is first sorted by the ID
variable. The first.ID
statement ensures that only the first occurrence of each ID
is kept, removing any subsequent duplicates.
Advantages of the DATA STEP:
- Fine-Grained Control: The
DATA STEP
allows you to apply custom logic to the deduplication process, such as retaining the first or last occurrence based on additional criteria. - Efficiency: When dealing with large datasets, this method can be more efficient, especially if you need to apply complex logic.
Disadvantages of the DATA STEP:
- Manual Sorting: You need to sort the data before removing duplicates, adding an extra step to the process.
- Complexity: The logic required to remove duplicates can be more complex and less intuitive than using
PROC SORT
.
Approach 3: Removing Duplicates with PROC SORT
PROC SORT
is one of the simplest and most commonly used methods for removing duplicates in SAS. This approach sorts the data and can automatically remove duplicates during the sorting process.
Example 1: Removing Exact Duplicates
proc sort data=original_data noduprecs out=no_duplicates;
by _all_;
run;
Here, PROC SORT
with the noduprecs
option removes exact duplicates. The by _all_
statement ensures that the sort is applied to all variables, making the deduplication based on the entire record.
Example 2: Removing Duplicates Based on Key Variables
proc sort data=original_data nodupkey out=no_duplicates;
by ID;
run;
In this case, PROC SORT
uses the nodupkey
option to remove duplicates based on the ID
variable. The out=
option specifies that the sorted and deduplicated data should be saved to the no_duplicates
dataset.
Advantages of PROC SORT:
- Simplicity:
PROC SORT
is straightforward and easy to use, requiring minimal code to remove duplicates. - Efficiency:
PROC SORT
is optimized for sorting and deduplication, making it very fast, especially for large datasets.
Disadvantages of PROC SORT:
- Limited Flexibility:
PROC SORT
can only remove duplicates based on sorted keys, which might not be suitable for more complex deduplication needs. - No Complex Logic: Unlike the
DATA STEP
,PROC SORT
does not allow you to apply custom logic or conditions during the deduplication process.
Comparison Summary
Each method for removing duplicates in SAS has its strengths and weaknesses:
- Use
PROC SQL
when you need flexibility and the ability to apply complex conditions for deduplication, especially when working within a SQL-based framework. - Use the
DATA STEP
if you require precise control over the deduplication process and need to apply custom logic to determine which duplicates to keep. - Use
PROC SORT
for its simplicity and efficiency when dealing with large datasets, particularly when you only need to remove duplicates based on simple keys.
Conclusion
Removing duplicates is a crucial step in data cleaning and preparation, and SAS provides multiple tools to accomplish this task. By understanding the differences between PROC SQL
, the DATA STEP
, and PROC SORT
, you can choose the most appropriate method for your specific data processing needs. Whether you need flexibility, control, or efficiency, SAS offers the right approach to ensure your data is clean and ready for analysis.
Monday, February 2, 2009
How to scan more than 20 records to determine variable attributes
In Versions 7 and 8 of the SAS System, by default the Import Wizard, PROC IMPORT and the External File Interface (EFI) scan 20 records to determine variable attributes when reading delimited text files.
Changing the default setting can only be done for EFI in Version 7, Release 8 and Release 8.1. Beginning in Release 8.2 changing the default setting is applicable to the Import Wizard, PROC IMPORT and EFI.
Follow the steps below to change the default behavior:
1) Type regedit on the command line (white box with a check mark)
2) When the Registry Editor window opens, double click on the PRODUCTS icon
3) Double click on the BASE icon
4) Double click on the EFI icon
5) In the window on the right the Contents of EFI will be populated with EFI options
6) Double click on GuessingRows
7) When the new window opens with the old value of 20, delete it, enter the new value, and click on OK.
8) Close the Registry Editor window
9) Invoke the Import Wizard, PROC IMPORT or EFI to use the new GuessingRows value
The new value entered for GuessingRows will remain until you change it.
Beginning in SAS 9.1 there is a new statement, GUESSINGROWS=, that can be specified with PROC IMPORT. By specifying the GUESSINGROWS= statement with PROC IMPORT, you do not have to change the GuessingRows value in the SAS Registry. Also beginning in SAS 9.1 you can specify the Number of Rows to Guess when using the Import Wizard in the SAS
Import: Delimited File Options Window or when using EFI in the Options for Import Window.
source:http://support.sas.com/kb/1/075.html