Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL
Best Practices for Joining Additional Columns into an Existing Table Using PROC SQL
When working with large datasets, it's common to add new columns from another table to an existing table using SQL. However, many programmers encounter the challenge of recursive referencing in PROC SQL when attempting to create a new table that references itself. This blog post discusses the best practices for adding columns to an existing table using PROC SQL and provides alternative methods that avoid inefficiencies.
1. The Common Approach and Its Pitfall
Here's a simplified example of a common approach to adding columns via a LEFT JOIN
:
PROC SQL;
CREATE TABLE WORK.main_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;
While this approach might seem straightforward, it leads to a warning: "CREATE TABLE statement recursively references the target table". This happens because you're trying to reference the main_table
both as the source and the target table in the same query. Furthermore, if you're dealing with large datasets, creating a new table might take up too much server space.
2. Best Practice 1: Use a Temporary Table
A better approach is to use a temporary table to store the joined result and then replace the original table. Here’s how you can implement this:
PROC SQL;
CREATE TABLE work.temp_table AS
SELECT main.*, a.newcol1, a.newcol2
FROM WORK.main_table main
LEFT JOIN WORK.addl_data a
ON main.id = a.id;
QUIT;
PROC SQL;
DROP TABLE work.main_table;
CREATE TABLE work.main_table AS
SELECT * FROM work.temp_table;
QUIT;
PROC SQL;
DROP TABLE work.temp_table;
QUIT;
This ensures that the original table is updated with the new columns without violating the recursive referencing rule. It also minimizes space usage since the temp_table
will be dropped after the operation.
3. Best Practice 2: Use ALTER TABLE
for Adding Columns
If you're simply adding new columns without updating existing ones, you can use the ALTER TABLE
statement in combination with UPDATE
to populate the new columns:
PROC SQL;
ALTER TABLE WORK.main_table
ADD newcol1 NUM,
newcol2 NUM;
UPDATE WORK.main_table main
SET newcol1 = (SELECT a.newcol1
FROM WORK.addl_data a
WHERE main.id = a.id),
newcol2 = (SELECT a.newcol2
FROM WORK.addl_data a
WHERE main.id = a.id);
QUIT;
This approach avoids creating a new table altogether, and instead modifies the structure of the existing table.
4. Best Practice 3: Consider the DATA Step MERGE
for Large Datasets
For very large datasets, a DATA step MERGE
can sometimes be more efficient than PROC SQL
. The MERGE
statement allows you to combine datasets based on a common key variable, as shown below:
PROC SORT DATA=WORK.main_table; BY id; RUN;
PROC SORT DATA=WORK.addl_data; BY id; RUN;
DATA WORK.main_table;
MERGE WORK.main_table (IN=in1)
WORK.addl_data (IN=in2);
BY id;
IF in1; /* Keep only records from main_table */
RUN;
While some might find the MERGE
approach less intuitive, it can be a powerful tool for handling large tables when combined with proper sorting of the datasets.
Conclusion
The best method for joining additional columns into an existing table depends on your specific needs, including dataset size and available server space. Using a temporary table or the ALTER TABLE
method can be more efficient in certain situations, while the DATA step MERGE
is a reliable fallback for large datasets.
By following these best practices, you can avoid common pitfalls and improve the performance of your SQL queries in SAS.