Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL

When working with large datasets, it's common to add new columns from another table to an existing table using SQL. However, many programmers encounter the challenge of recursive referencing in PROC SQL when attempting to create a new table that references itself. This blog post discusses the best practices for adding columns to an existing table using PROC SQL and provides alternative methods that avoid inefficiencies.

1. The Common Approach and Its Pitfall

Here's a simplified example of a common approach to adding columns via a LEFT JOIN:

PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;

While this approach might seem straightforward, it leads to a warning: "CREATE TABLE statement recursively references the target table". This happens because you're trying to reference the main_table both as the source and the target table in the same query. Furthermore, if you're dealing with large datasets, creating a new table might take up too much server space.

2. Best Practice 1: Use a Temporary Table

A better approach is to use a temporary table to store the joined result and then replace the original table. Here’s how you can implement this:

PROC SQL;
   CREATE TABLE work.temp_table AS 
   SELECT main.*, a.newcol1, a.newcol2
   FROM WORK.main_table main
   LEFT JOIN WORK.addl_data a
   ON main.id = a.id;
QUIT;

PROC SQL;
   DROP TABLE work.main_table;
   CREATE TABLE work.main_table AS 
   SELECT * FROM work.temp_table;
QUIT;

PROC SQL;
   DROP TABLE work.temp_table;
QUIT;

This ensures that the original table is updated with the new columns without violating the recursive referencing rule. It also minimizes space usage since the temp_table will be dropped after the operation.

3. Best Practice 2: Use `ALTER TABLE` for Adding Columns

If you're simply adding new columns without updating existing ones, you can use the ALTER TABLE statement in combination with UPDATE to populate the new columns:

PROC SQL;
   ALTER TABLE WORK.main_table
   ADD newcol1 NUM, 
       newcol2 NUM;

   UPDATE WORK.main_table main
   SET newcol1 = (SELECT a.newcol1 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id),
       newcol2 = (SELECT a.newcol2 
                  FROM WORK.addl_data a 
                  WHERE main.id = a.id);
QUIT;

This approach avoids creating a new table altogether, and instead modifies the structure of the existing table.

4. Best Practice 3: Consider the `DATA Step MERGE` for Large Datasets

For very large datasets, a DATA step MERGE can sometimes be more efficient than PROC SQL. The MERGE statement allows you to combine datasets based on a common key variable, as shown below:

PROC SORT DATA=WORK.main_table; BY id; RUN;
PROC SORT DATA=WORK.addl_data; BY id; RUN;

DATA WORK.main_table;
   MERGE WORK.main_table (IN=in1)
         WORK.addl_data (IN=in2);
   BY id;
   IF in1; /* Keep only records from main_table */
RUN;

While some might find the MERGE approach less intuitive, it can be a powerful tool for handling large tables when combined with proper sorting of the datasets.

Conclusion

The best method for joining additional columns into an existing table depends on your specific needs, including dataset size and available server space. Using a temporary table or the ALTER TABLE method can be more efficient in certain situations, while the DATA step MERGE is a reliable fallback for large datasets.

By following these best practices, you can avoid common pitfalls and improve the performance of your SQL queries in SAS.

STUDYSAS BLOG