Discover More Tips and Techniques on This Blog

Showing posts with label Proc SQL. Show all posts
Showing posts with label Proc SQL. Show all posts
Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

When working with large datasets, it's common to add new columns from another table to an existing table using SQL. However, many programmers encounter the challenge of recursive referencing in PROC SQL when attempting to create a new table that references itself. This blog post discusses the best practices for adding columns to an existing table using PROC SQL and provides alternative methods that avoid inefficiencies.

1. The Common Approach and Its Pitfall

Here's a simplified example of a common approach to adding columns via a LEFT JOIN:

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

While this approach might seem straightforward, it leads to a warning: "CREATE TABLE statement recursively references the target table". This happens because you're trying to reference the main_table both as the source and the target table in the same query. Furthermore, if you're dealing with large datasets, creating a new table might take up too much server space.

2. Best Practice 1: Use a Temporary Table

A better approach is to use a temporary table to store the joined result and then replace the original table. Here’s how you can implement this:

PROC SQL;
   CREATE TABLE work.temp_table AS 
   SELECT main.*, a.newcol1, a.newcol2
   FROM WORK.main_table main
   LEFT JOIN WORK.addl_data a
   ON main.id = a.id;
QUIT;

PROC SQL;
   DROP TABLE work.main_table;
   CREATE TABLE work.main_table AS 
   SELECT * FROM work.temp_table;
QUIT;

PROC SQL;
   DROP TABLE work.temp_table;
QUIT;

This ensures that the original table is updated with the new columns without violating the recursive referencing rule. It also minimizes space usage since the temp_table will be dropped after the operation.

3. Best Practice 2: Use ALTER TABLE for Adding Columns

If you're simply adding new columns without updating existing ones, you can use the ALTER TABLE statement in combination with UPDATE to populate the new columns:

PROC SQL;
   ALTER TABLE WORK.main_table
   ADD newcol1 NUM, 
       newcol2 NUM;

   UPDATE WORK.main_table main
   SET newcol1 = (SELECT a.newcol1 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id),
       newcol2 = (SELECT a.newcol2 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id);
QUIT;

This approach avoids creating a new table altogether, and instead modifies the structure of the existing table.

4. Best Practice 3: Consider the DATA Step MERGE for Large Datasets

For very large datasets, a DATA step MERGE can sometimes be more efficient than PROC SQL. The MERGE statement allows you to combine datasets based on a common key variable, as shown below:

PROC SORT DATA=WORK.main_table; BY id; RUN;
PROC SORT DATA=WORK.addl_data; BY id; RUN;

DATA WORK.main_table;
   MERGE WORK.main_table (IN=in1)
         WORK.addl_data (IN=in2);
   BY id;
   IF in1; /* Keep only records from main_table */
RUN;

While some might find the MERGE approach less intuitive, it can be a powerful tool for handling large tables when combined with proper sorting of the datasets.

Conclusion

The best method for joining additional columns into an existing table depends on your specific needs, including dataset size and available server space. Using a temporary table or the ALTER TABLE method can be more efficient in certain situations, while the DATA step MERGE is a reliable fallback for large datasets.

By following these best practices, you can avoid common pitfalls and improve the performance of your SQL queries in SAS.

Mastering PROC SQL: Best Practices for Seamlessly Adding Columns

Mastering PROC SQL: Best Practices for Seamlessly Adding Columns

Mastering PROC SQL: Best Practices for Seamlessly Adding Columns

PROC SQL in SAS is a powerful tool, especially when you need to add new columns to existing tables. Whether you're updating tables with new data or making calculated columns, following best practices ensures your code is efficient, maintainable, and easy to understand.

1. Use ALTER TABLE to Add Columns

The most straightforward way to add a column to a table is by using the ALTER TABLE statement. It allows you to add one or more columns without disrupting the existing data structure.

proc sql;
   alter table your_table_name
   add new_column1 num format=8. label='New Column 1',
       new_column2 char(20) label='New Column 2';
quit;
Tip: Always specify the data type (num for numeric, char(n) for character) and add useful formats and labels to keep your columns clear and organized.

2. Avoid Duplicating Column Names

Before adding columns, check if a column with the same name already exists. SAS will throw an error if you try to add a column that’s already in the table.

proc sql;
   describe table your_table_name;
quit;

Use the DESCRIBE statement to view the structure and ensure no duplicate column names exist.

3. Efficient Data Insertion

After adding a column, you may need to initialize it with default values. You can do this in a single step using the UPDATE statement to ensure all rows are filled efficiently:

proc sql;
   alter table your_table_name
   add new_column num format=8. label='New Numeric Column';

   update your_table_name
   set new_column = 0; /* Default value */
quit;

4. Use CREATE TABLE for Complex Column Addition

If you need to add columns based on existing data or calculations, it's often better to create a new table. This prevents any unintentional changes to the original table and allows for cleaner manipulation of data:

proc sql;
   create table new_table as
   select existing_column,
          calculated(new_column) as new_column format=8. label='Calculated Column'
   from your_table_name;
quit;
Pro Tip: Use calculated to create columns based on other columns in the same query.

5. Consider Indexing for Performance

If the new column will be used in sorting or joins, adding an index can significantly boost performance. Here’s how to create an index on your new column:

proc sql;
   create index idx_new_column on your_table_name(new_column);
quit;

6. Use Comments for Clarity

Always document your SQL code! Adding comments helps future developers (or your future self) understand the purpose of each column addition:

proc sql;
   alter table your_table_name
   add new_column char(10) label='New Character Column'; /* Added for analysis purposes */
quit;

Final Thoughts

By following these best practices for adding columns via PROC SQL, you can keep your SAS programming efficient and clear. Whether you're maintaining legacy systems or building new reports, these tips will help you seamlessly integrate new data into your tables.

IFC and IFN functions: New IF functions:

Objective: To Reduce the amount of typing required achieving an objective

Syntax: IFN (condition, true, false, missing): ‘N’ stands for Numeric
IFN returns a numeric value. It returns the true, false or missing value depending on whether the condition is true, false or missing.

Syntax: IFC (condition, true, false, missing): ‘C’ stands for character
IFC function has four parameters:
1) a logical expression
2) character value returned when true
3) value returned when false
4) value retuned when missing, which is optional.

IFC (logical-expression, Character-value-returned-when-true, Character-value-returned-when-false, Character-value-returned-when-missing);
IFC returns a character value. It returns the true, false or missing value depending on whether the condition is true, false or missing.

Example: Assign a value to the VISIT variable (new) as per the VTYPE variable value.
We can certainly achieve this task in diff. ways.. here are they...

data old;
input sitesub $ vtype vdate $;
cards;
01-303 1.4 12/23/2005
01-304 1.5 09/03/2005
01-305 1.4 10/09/2005
01-306 1.5 11/17/2005
01-307 1.5 05/29/2005
01-308 . 04/30/2005
;
run;

1) * Using Proc Format:
proc format;
value vt
1.4='Baseline'
1.5='Retreat'
.='Missing'
;
run;
data new;
set old;
length visit $20;
visit=put(vtype,vt.);
run;

2) *Using the IF-THEN/ELSE statements;
data new;
set old;
length visit $20;
if vtype=1.4 then visit='Baseline';
else if vtype=1.5 then visit='Retreat';
else if vtype=. then visit='Missing';
run;

3) *Using the Proc SQL;
Proc sql;
Create table new4 as
Select *,
case
when vtype=1.4 then 'Baseline'
when vtype=1.5 then 'Retreat'
else 'Missing'
end as visit
from old;
Quit;

All three above methods required significant amount of typing when we compared with the below ones......
*Using the IFC function in Datastep;
data new;
set old;
length visit $20;
visit=ifc(vtype=1.4,'Baseline','Retreat','Missing');
run;


*Using the IFC function in Proc SQl;
Proc sql;
Create table new as
Select *,
Ifc(vtype=1.4,'Baseline','Retreat','Missing') as visit
from old;
Quit;

In a DATA step, if the IFC function returns a value to a variable that has not previously been assigned a length, then that variable is given a length of 200 bytes.

Note: IFN and IFN functions cannot be used if we want to assign more than 3 values (including missing) to a variable.

Proc Compare/Dictionary.Coulmns/Dictionary.Tables.: Program used to compare the SAS datasets in two directories

Here is the new Proc compare.sas program, I have developed ....to compare all the datasets in 2 directories(testing and production) at once and to quick check any mismatches.

Proc compare only check if there is any mismatches between the datasets in 2 directories. If any, it reports otherwise it will give us a note saying that:

Note: No unequal Values were found. All values compared are exactly equal.

See the proc compare snap shot:


What if any dataset has the length more than 8, and what if any variable length more than 40 and what if the dataset name has more than 8 characters etc... Proc Compare doesn't address this issue.

I have developed the following program to address this issue. It’s a mandatory that we need to follow certain requirements when we are preparing for an electronic submission to the FDA.

The following are some of the QC checks FDA requirements:
1) The length of a dataset name & variable name shouldn’t be more than 8 characters.
2) The length data set label and a variable label shouldn’t be more than 40 characters.

This following program will give the SAS programmer a basic idea of how to check the dataset and variable attributes using the metadata (dictionary.columns and dictionary.tables) using Proc SQL. This program will save us some critical time.


Here are the details this program will give us…
1) Compare the variable attributes and prints the differences (length, format and informats) between production and testing directories .
2) Compare the labels, no. of observations and no. of variables in the datasets and prints if there is any differences between testing and production directories.
3) Checks the data set label and its length and prints if any dataset name GT 8 and dataset label GT 40.
4) Checks the variable label and their lengths of a dataset and prints if any dataset name GT 8 and dataset label GT 40.
5) Checks length of (character)variables and prints them if any variable has GT 200 in length;

****************************************************************;
*** Program: proccompare.sas ***;
*** Version: 1.0 ***;
*** Client: ABC Pharmaceuticals, Inc. ***;
*** Protocol: ABC-2009 ***;
*** Programmer: Sarath Annapareddy ***;
*** Date: Mar 31st 2009 ***;
*** Purpose: Program used to compare the attributes *** lengths,labels,formats and ***;
*** informats) of datasets in production and testing libraries***;
*** Program also used to check the length of variables in each***; *** dataset. ***;
*****************************************************************;

libname test 'H:\company\client\Testing\#####\###########\### datasets';
libname prodn 'H:\company\client\Testing\#####\###########\ ### datasets';

*creating the proc contents like output with Proc Sql;
proc sql noprint;
create table _test as
select memname label='Dataset Name',
name label='Variable',
type label='Type',
length as length,
label,format label='Format',
informat label='Informat'
from dictionary.columns
where indexw("TEST",libname)
order by memname, name;
create table _test1 as
select distinct libname,memname,memlabel,nobs,nvar
from dictionary.tables
where (indexw("TEST",libname));
quit;

*creating the proc contents like output with Proc Sql;
proc sql noprint;
create table _prodn as
select memname label='Dataset Name',
name label='Variable',
type label='Type',
length as length,
label,format label='Format',
informat label='Informat'
from dictionary.columns
where indexw("PRODN",libname)
order by memname, name;
create table _prodnl as
select distinct libname,memname,memlabel,nobs,nvar
from dictionary.tables
where (indexw("PRODN",libname));
quit;

*Run proc compare to check variable attributes in prodn and test directories;
ods listing close;
ods rtf style=style.rtf file="Compare_vars_Out.rtf";
proc compare data=_prodn compare=_test;
id memname name label;
run;
ods rtf close;
ods listing;

*Run proc compare to check labels, no. of obs and no. of variables of the datasets;
ods listing close;
ods rtf style=style.rtf file="Compare_dataset_Out.rtf";
proc compare data=_prodnl(drop=libname) compare=_test1(drop=libname);
run;

ods rtf close;
ods listing;

*Check analysis data set name, label and their lengths;
ods listing close;
ods rtf style=style.rtf file="variable_length_check.rtf";
proc sql noprint;
create table v_length as
select memname label='Dataset Name', length(memname) as nam_lnth, memlabel label='Variable',
length(memlabel) as lab_lnth from dictionary.tables
where libname="PRODN" and (length(memname)>8 or length(memlabel)>40);
quit;
ods rtf close;
ods listing;

ods listing close;
ods rtf style=style.rtf file="label_length_check.rtf";
*Check variable name, label and their lengths;
proc sql noprint;
create table l_length as
select memname label='Dataset Name', name label='Variable', length(name) as var_lnth, label,
length(label)as lab_lnth from dictionary.columns
where libname="PRODN" and (length(name)>8 or length(label)>40);
quit;

ods rtf close;
ods listing;

*Check length of character variable values that were defined GT 200;
ods listing close;
ods rtf style=style.rtf file="variables_gt_ 200_length.rtf";
proc sql noprint;
create table longvar as
select memname, name, length
from dictionary.columns
where libname="PRODN" and length > 200;
quit;

ods rtf close;
ods listing;




TS-DOC TS-440 - How can I use PROC COMPARE to produce a report ... -

How to determine the executing program name and path programatically

Sometimes, we need to keep the name and path of the executing program in the FOOTNOTE of the generated table or listings.

I have always created a macro variable using the %let statement and, then I called the Macro variable in the footnote statement to get the name of the program. Eventhough it is simple.. it may not be suitable when we write application which need to self document...

Here is another technique which can be used to retrieve the pathname and filename (last executed) .....

To get the last opened filename:

proc sql noprint;
select scan(xpath,-1,'\') into :progname from sashelp.vextfl
where upcase(xpath) like '%.SAS';
quit;

%put &progname;

result: filename.sas

or

data _null_;
set sashelp.vextfl(where=(upcase(xpath) like '%.SAS'));
call symput('program', scan(xpath,-1,'\'));
run;


%put &program;

result: filename.sas
To get path name ... then .. use the following code;


proc sql noprint;
select xpath into :progname
from sashelp.vextfl where upcase(xpath) like '%.SAS';
quit;

%put &progname;

Result:
H:\Company\Client\Study\ABC304\Programs\filename.sas

or

data _null_;
set sashelp.vextfl
(where=(upcase(xpath) like '%.SAS'));
call symput('program', xpath));
run;


%put &program;

Result:
H:\Company\Client\Study\ABC304\Programs\filename.sas

Here is another coolway to retrieve the filename and path of last executed program;

If you are using Windows operating system, we can access the executing program and its path using two environmental variables SAS_EXECFILENAME and SAS_EXECFILEPATH.

To access the values of these two variables use %SYSGET macro function.
The below macro calls(pathfind and filefind) gets u the name of the executing program and its path:

%macro pathfind;
%sysget(SAS_EXECFILEPATH)
%mend pathfind;
%put %pathfind;

%macro filefind;
%sysget(SAS_EXECFILENAME)
%mend filefind;
%put %filefind;

PROC SQL basics, tips and techniques and sample code programs

Proc SQL:


Power of SAS SQL:

• SQL looks at datasets differently from SAS
– SAS looks at a dataset one record at a time, using an implied loop that moves from the first record to the last
– SQL looks at all the records, as a single object
• Because of this difference SQL can easily do few things that are more difficult to do in SAS
• There are a number of built in functions in SQL that can be used in a select statement
• Because of how SQL handles a dataset, these functions work over the entire dataset
• Functions:

– Count: Counts Values
– Sum: Sums Values
– Max: Identifies the largest value
– Min: Identifies the smallest value
– Mean: Averages the values

Read more at www.cognigencorp.com/perspective/tipsNtricks.pub/1/PROC%20SQL%20Talk_12_.ppt

SAS SQL

Introduction to Proc SQL

AN INTRODUCTION TO PROC SQL®

PROC SQL: When and How to Use It?

Proc SQL – A Primer for SAS Programmers

Understanding PROC SQL

Creating Macro Variables with Proc SQL

DATA Step vs. PROC SQL: What’s a neophyte to do?


Effective Use of SQL in SAS Programming

Variable List Short-Cuts in PROC SQL

Ten Best PROC SQL Tips and Techniques

Undocumented and Hard-to-Find PROC SQL C2 AE_Features

Variable List Short-Cuts in PROC SQL

Exploring the World of PROC SQL® Joins

Using PROC SQL to Create Ad Hoc Reports

CREATING MACRO VARIABLES VIA PROC SQL

PROC SQL for DATA Step Die-hards

Calculating Changes and Differences Using PROC SQL —With Clinical Data Examples

Validating Data Via PROC SQL

Mastering Duplicates Removal in SAS: A Comprehensive Guide to Using PROC SQL, DATA STEP, and PROC SORT

Removing Duplicate Observations in SAS: A Comprehensive Guide

Removing Duplicate Observations in SAS: A Comprehensive Guide

In data analysis, it's common to encounter datasets with duplicate records that need to be cleaned up. SAS offers several methods to remove these duplicates, each with its strengths and suitable scenarios. This article explores three primary methods for removing duplicate observations: using PROC SQL, the DATA STEP, and PROC SORT. We will provide detailed examples and discuss when to use each method.

Understanding Duplicate Observations

Before diving into the methods, let's clarify what we mean by duplicate observations. Duplicates can occur in different forms:

  • Exact Duplicates: All variables across two or more observations have identical values.
  • Key-Based Duplicates: Observations are considered duplicates based on the values of specific key variables (e.g., ID, Date).

The method you choose to remove duplicates depends on whether you are dealing with exact duplicates or key-based duplicates.

Approach 1: Removing Duplicates with PROC SQL

PROC SQL is a versatile tool in SAS, allowing you to execute SQL queries to manipulate and analyze data. When removing duplicates, you can use the SELECT DISTINCT statement or apply more complex conditions.

Example 1: Removing Exact Duplicates

proc sql;
    create table no_duplicates as
    select distinct *
    from original_data;
quit;

This code removes all exact duplicates, creating a new dataset no_duplicates that contains only unique records. The SELECT DISTINCT * statement ensures that every unique combination of variable values is retained only once.

Example 2: Removing Duplicates Based on Key Variables

proc sql;
    create table no_duplicates as
    select distinct ID, Name, Age
    from original_data;
quit;

Here, duplicates are removed based on the combination of the ID, Name, and Age variables. This is useful when you want to keep unique records for specific key variables, ignoring other variables in the dataset.

Advantages of PROC SQL:

  • Flexibility: PROC SQL can handle complex queries, allowing you to remove duplicates based on multiple or complex criteria.
  • Powerful Filtering: SQL allows you to apply conditions and filters easily, making it easier to control the exact duplicates you want to remove.

Disadvantages of PROC SQL:

  • Performance: The SELECT DISTINCT statement can be slower with very large datasets, as it requires scanning the entire dataset to identify unique records.
  • Complexity: SQL syntax may be less intuitive for those who are more comfortable with traditional SAS programming.

Approach 2: Removing Duplicates with the DATA STEP

The DATA STEP in SAS provides a programmatic approach to removing duplicates, giving you fine-grained control over the process. This method typically involves sorting the dataset first and then using conditional logic to remove duplicates.

Example 1: Removing Exact Duplicates

To remove exact duplicates, you must first sort the data by all variables and then use the DATA STEP to retain only the first occurrence of each observation.

proc sort data=original_data noduprecs out=sorted_data;
    by _all_;
run;

data no_duplicates;
    set sorted_data;
run;

The noduprecs option in PROC SORT removes exact duplicate records. The sorted and deduplicated dataset is then saved as no_duplicates.

Example 2: Removing Duplicates Based on Key Variables

If you want to remove duplicates based on specific key variables, you can sort the data by those variables and use the first. or last. functions in the DATA STEP to control which duplicates are kept.

proc sort data=original_data;
    by ID;
run;

data no_duplicates;
    set original_data;
    by ID;
    if first.ID;
run;

In this example, the dataset is first sorted by the ID variable. The first.ID statement ensures that only the first occurrence of each ID is kept, removing any subsequent duplicates.

Advantages of the DATA STEP:

  • Fine-Grained Control: The DATA STEP allows you to apply custom logic to the deduplication process, such as retaining the first or last occurrence based on additional criteria.
  • Efficiency: When dealing with large datasets, this method can be more efficient, especially if you need to apply complex logic.

Disadvantages of the DATA STEP:

  • Manual Sorting: You need to sort the data before removing duplicates, adding an extra step to the process.
  • Complexity: The logic required to remove duplicates can be more complex and less intuitive than using PROC SORT.

Approach 3: Removing Duplicates with PROC SORT

PROC SORT is one of the simplest and most commonly used methods for removing duplicates in SAS. This approach sorts the data and can automatically remove duplicates during the sorting process.

Example 1: Removing Exact Duplicates

proc sort data=original_data noduprecs out=no_duplicates;
    by _all_;
run;

Here, PROC SORT with the noduprecs option removes exact duplicates. The by _all_ statement ensures that the sort is applied to all variables, making the deduplication based on the entire record.

Example 2: Removing Duplicates Based on Key Variables

proc sort data=original_data nodupkey out=no_duplicates;
    by ID;
run;

In this case, PROC SORT uses the nodupkey option to remove duplicates based on the ID variable. The out= option specifies that the sorted and deduplicated data should be saved to the no_duplicates dataset.

Advantages of PROC SORT:

  • Simplicity: PROC SORT is straightforward and easy to use, requiring minimal code to remove duplicates.
  • Efficiency: PROC SORT is optimized for sorting and deduplication, making it very fast, especially for large datasets.

Disadvantages of PROC SORT:

  • Limited Flexibility: PROC SORT can only remove duplicates based on sorted keys, which might not be suitable for more complex deduplication needs.
  • No Complex Logic: Unlike the DATA STEP, PROC SORT does not allow you to apply custom logic or conditions during the deduplication process.

Comparison Summary

Each method for removing duplicates in SAS has its strengths and weaknesses:

  • Use PROC SQL when you need flexibility and the ability to apply complex conditions for deduplication, especially when working within a SQL-based framework.
  • Use the DATA STEP if you require precise control over the deduplication process and need to apply custom logic to determine which duplicates to keep.
  • Use PROC SORT for its simplicity and efficiency when dealing with large datasets, particularly when you only need to remove duplicates based on simple keys.

Conclusion

Removing duplicates is a crucial step in data cleaning and preparation, and SAS provides multiple tools to accomplish this task. By understanding the differences between PROC SQL, the DATA STEP, and PROC SORT, you can choose the most appropriate method for your specific data processing needs. Whether you need flexibility, control, or efficiency, SAS offers the right approach to ensure your data is clean and ready for analysis.

PROC SQL vs. DATA STEP in SAS: A Comprehensive Comparison of Syntax, Strengths, and Use Cases

PROC SQL vs. DATA STEP in SAS: A Comprehensive Comparison of Syntax and Use Cases

PROC SQL vs. DATA STEP in SAS: A Comprehensive Comparison of Syntax and Use Cases

SAS provides powerful tools for data manipulation, with `PROC SQL` and `DATA STEP` being two of the most commonly used approaches. Each has its own strengths, syntax, and use cases, making them suitable for different types of tasks. This report provides a detailed comparison of `PROC SQL` and `DATA STEP` to help you understand when and how to use each approach effectively.

Understanding PROC SQL and DATA STEP

Before diving into the comparison, let's briefly introduce `PROC SQL` and `DATA STEP`:

  • PROC SQL: A procedure that enables you to use SQL (Structured Query Language) within SAS to query, manipulate, and manage data. It is particularly powerful for operations that involve multiple tables or require complex querying.
  • DATA STEP: A foundational part of SAS programming, allowing you to create, transform, and analyze datasets. `DATA STEP` is ideal for row-wise processing, data transformation, and straightforward data manipulation tasks.

Comparing Syntax: PROC SQL vs. DATA STEP

Below are several common data manipulation tasks, with examples of how they are performed using both `PROC SQL` and `DATA STEP`.

1. Creating a New Dataset

Creating a new dataset is a fundamental task in SAS. Both `PROC SQL` and `DATA STEP` can accomplish this, but their approaches differ.

PROC SQL:

proc sql;
    create table new_data as
    select *
    from old_data;
quit;

DATA STEP:

data new_data;
    set old_data;
run;

Comparison: The `DATA STEP` uses the `SET` statement to reference the source dataset, which is more direct. `PROC SQL`, on the other hand, uses the `CREATE TABLE` statement combined with a `SELECT` statement, which can be more powerful when selecting specific columns or applying complex logic.

2. Filtering Data

Filtering allows you to create a subset of data based on specific criteria.

PROC SQL:

proc sql;
    create table filtered_data as
    select *
    from old_data
    where age > 30;
quit;

DATA STEP:

data filtered_data;
    set old_data;
    if age > 30;
run;

Comparison: Both approaches allow for straightforward data filtering. In `PROC SQL`, filtering is done within the `WHERE` clause of the `SELECT` statement, while in `DATA STEP`, filtering is achieved using the `IF` statement. Both methods are efficient, though `PROC SQL` can be more concise for complex filtering logic.

3. Joining Datasets

Joining datasets is essential when you need to combine information from two or more tables. This is a common scenario in relational databases.

PROC SQL:

proc sql;
    create table joined_data as
    select a.*, b.*
    from dataset_a as a
    inner join dataset_b as b
    on a.id = b.id;
quit;

DATA STEP:

proc sort data=dataset_a;
    by id;
run;

proc sort data=dataset_b;
    by id;
run;

data joined_data;
    merge dataset_a(in=a) dataset_b(in=b);
    by id;
    if a and b;
run;

Comparison: `PROC SQL` is more intuitive for performing joins, offering flexibility with different types of joins (INNER, LEFT, RIGHT, FULL). The `DATA STEP` requires the datasets to be pre-sorted and uses the `MERGE` statement, which is effective but more manual. `PROC SQL` is generally preferred for complex joins.

4. Aggregating Data

Aggregation, such as calculating sums, averages, or counts, is another common task in data analysis.

PROC SQL:

proc sql;
    create table summary_data as
    select id, mean(value) as avg_value
    from old_data
    group by id;
quit;

DATA STEP:

proc sort data=old_data;
    by id;
run;

data summary_data;
    set old_data;
    by id;
    retain sum_value count_value;
    if first.id then do;
        sum_value = 0;
        count_value = 0;
    end;
    sum_value + value;
    count_value + 1;
    if last.id then do;
        avg_value = sum_value / count_value;
        output;
    end;
run;

Comparison: `PROC SQL` simplifies aggregation with the `GROUP BY` clause, making it easier to perform summary statistics. The `DATA STEP` approach requires more steps, including sorting and manually calculating the aggregates using `RETAIN` and `FIRST./LAST.` logic. `PROC SQL` is generally more efficient for this type of task.

5. Subsetting Data

Subsetting involves creating a smaller dataset that meets specific criteria from a larger dataset.

PROC SQL:

proc sql;
    create table subset_data as
    select *
    from old_data
    where sex = 'M' and age > 25;
quit;

DATA STEP:

data subset_data;
    set old_data;
    if sex = 'M' and age > 25;
run;

Comparison: The syntax for subsetting data is quite similar between `PROC SQL` and `DATA STEP`. Both approaches are efficient, with `PROC SQL` using the `WHERE` clause and `DATA STEP` using the `IF` statement to achieve the same result.

Strengths and Weaknesses of Each Approach

Both `PROC SQL` and `DATA STEP` have distinct strengths and weaknesses, depending on the task at hand:

PROC SQL

  • Strengths:
    • Ideal for complex queries and manipulations involving multiple datasets.
    • Powerful for joins, aggregations, and subqueries.
    • SQL syntax is familiar to those with a background in database management.
    • Concise code for tasks like aggregation and filtering.
  • Weaknesses:
    • May be less efficient for row-wise operations compared to `DATA STEP`.
    • Requires a learning curve for those unfamiliar with SQL.
    • Less intuitive for tasks involving iterative processing or complex data transformations.

DATA STEP

  • Strengths:
    • Excellent for row-wise data processing and transformations.
    • Native to SAS, making it familiar to SAS users.
    • Flexible for custom data manipulation, including loops and conditional logic.
    • Efficient for large datasets requiring simple, straightforward processing.
  • Weaknesses:
    • Requires more code for tasks like joins and aggregation, which `PROC SQL` handles more succinctly.
    • Sorting is often required before merging datasets.
    • Less concise for complex querying and multiple dataset manipulations.

Choosing Between PROC SQL and DATA STEP

Deciding between `PROC SQL` and `DATA STEP` depends on the specific requirements of your task:

  • Choose `PROC SQL` when:
    • You need to perform complex joins, aggregations, or subqueries.
    • Your task involves querying and manipulating multiple datasets simultaneously.
    • You are familiar with SQL or prefer SQL syntax.
    • Efficiency is important for tasks like summarization and filtering.
  • Choose `DATA STEP` when:
    • You need to perform row-wise operations, data transformations, or complex conditional logic.
    • Your task involves data cleaning, sorting, or simple merges.
    • You are more comfortable with SAS-specific programming and need flexibility in data manipulation.
    • Efficiency is needed for handling large datasets with straightforward processing requirements.

Conclusion

Both `PROC SQL` and `DATA STEP` are powerful tools in SAS, each with its own advantages and ideal use cases. Understanding the differences in their syntax and capabilities allows you to choose the most appropriate tool for your specific data manipulation tasks. Whether you prefer the flexibility of SQL or the procedural control of `DATA STEP`, mastering both will enhance your ability to handle complex data processing in SAS efficiently.

Ten Great Reasons to Learn SQL Procedure/ HOW TO WORK WITH SUBQUERY IN THE SQL PROCEDURE

Ten Great Reasons to Learn SQL Procedure

Top 10 List
# 1. Using the Pass-Through Facility
# 2. Joining Tables of Data
# 3. Creating and Using Views
# 4. Producing Reports
# 5. Using Summary Functions
# 6. Creating and Modifying Tables
# 7. Grouping Data
# 8. Ordering Data
# 9. Subsetting Data
# 10. Retrieving (Querying) Data

source: ssc.utexas.edu



WORKING WITH SUBQUERY IN THE SQL PROCEDURE

Example 1: Find patients who have records in table DEMO but not in table VITAL.

Proc SQL;
Select patid from demog
Where patid not in (select patid from vital);QUIT;
Result:
PATID
-----
150

Example 2: Find patients who didn’t have their blood pressures measured twice at a visit.
Proc SQL;
select patid from vital T1
where 2>(select count(*)
from vital T2
where T1.patid=T2.patid
and
T1.visid=T2.visid);QUIT;
Result:
PATID
-----
110
140

Example 3: Find patients who have the maximum number of visits.
Proc SQL;
Select distinct patid
from vitmean as a
Where not exists
(select visid from vitmean
Except
Select visid from vitmean as b
Where a.patid=b.patid);QUIT;
Result:
PATID
-----
120

Example 4: Find patients whose age is in the average age +/- 5 years.
Proc SQL;
Select patid
from demog
where age between
(select Avg(age)
from demog)-5
and
(select avg(age)
from demog)+5; QUIT;
Result:
PATID
------
110

Example 4: Find patients who didn’t have maximum number of visits.
Proc SQL;
Select patid
From vitmean as a
Group by patid
Having count(visid)
<(select max(cnt) from (select count(visid) as cnt from vitmean group by patid) ); QUIT; Result:
PATID
------
100
110
130
140

Example 5: Calculate the interval between this visit and previous visit for each patient visit.

proc sql;
select a.patid, a.visid,
intck('day', b.visdate, a.visdate)
as interval
from
(select distinct patid,
visid, visdate
from vital
where visid>'1') as a,
(select distinct patid,
visid, visdate
from vital
where visid<'4') as b where(input(a.visid,2.)-1 =input(b.visid,2.)) and a.patid=b.patid order by a.patid, a.visid; QUIT; Result:
PATID VISID INTERVAL
----------------------------
100 2 7
100 3 7
120 2 8
120 3 6
120 4 34
130 2 35
130 3 38
140 2 9

Example 6: Find patients who have exact same visits.
proc sql;
create table vital as
select distinct patid, visid
from vital;
select T1.PATID as PATID,
T2.PATID as PATID1
FROM
(select patid, visid from VITAL)
AS T1
inner join
(select patid, visid from VITAL)
AS T2
on T1.visid=T2.visid
and T1.patid < patid="T1.PATID)" patid="T2.PATID);" color="#cc33cc">RESULT:
PATID PATID1
-----------------
100 130


APPENDIX: EXAMPLE SAS DATA SETS For The ABOVE PROGRAMS:

Data demog;
input patid $1-3 age 5-6 sex $ 8;
datalines;
100 34 M
110 45 F
120 67 M
130 55 M
140 40 F
150 38 M
;
run;

data vital;
input patid $
visid $
visdate mmddyy10.
diabp
sysbp;
format visdate mmddyy10.;
datalines;
100 1 01/12/97 66 110
100 1 01/12/97 68 110
100 2 01/19/97 66 .
100 2 01/19/97 66 110
100 3 01/26/97 68 120
100 3 01/26/97 68 120
110 1 02/03/97 68 110
110 3 02/12/97 64 115
110 3 02/12/97 68 115
120 1 04/05/97 66 105
120 1 04/05/97 64 105
120 2 04/13/97 110 64
120 2 04/13/97 105 82
120 3 04/19/97 63 105
120 3 04/19/97 105 90
120 4 05/23/97 90 106
120 4 05/23/97 92 108
130 1 02/12/97 66 111
130 1 02/12/97 67 110
130 2 03/19/97 66 109
130 2 03/19/97 66 110
130 3 04/26/97 68 121
130 3 04/26/97 68 118
140 1 02/03/97 68 110
140 2 02/12/97 64 115
140 2 02/12/97 68 115
;
run;

source: www.nesug.org/proceedings/nesug98/dbas/p005.pdf

Disclosure:

In the spirit of transparency and innovation, I want to share that some of the content on this blog is generated with the assistance of ChatGPT, an AI language model developed by OpenAI. While I use this tool to help brainstorm ideas and draft content, every post is carefully reviewed, edited, and personalized by me to ensure it aligns with my voice, values, and the needs of my readers. My goal is to provide you with accurate, valuable, and engaging content, and I believe that using AI as a creative aid helps achieve that. If you have any questions or feedback about this approach, feel free to reach out. Your trust and satisfaction are my top priorities.