Power Up Your Data Cleaning with the SAS COMPRESS Function

Power Up Your Data Cleaning with the SAS COMPRESS Function

When handling large datasets in SAS, it's common to encounter unwanted characters, extra spaces, or other clutter that can hamper your data analysis. Fortunately, the COMPRESS function helps you clean up your text data efficiently. It can remove, or even keep, specific characters from your strings with minimal effort. Keep reading to learn how you can harness the full potential of the SAS COMPRESS function.

1. Quick Overview of the COMPRESS Function

The COMPRESS function in SAS removes (or optionally keeps) certain characters from a character string. Its basic syntax looks like this:

result_string = COMPRESS(source_string <, characters_to_remove> <, modifiers>);
  • source_string: The original string you want to modify.
  • characters_to_remove (optional): A list of specific characters to eliminate.
  • modifiers (optional): Special flags (e.g., remove digits, punctuation, etc.).

2. Removing Specific Characters

Suppose you have a string containing multiple symbols and you only want to remove a specific one, such as the ampersand (&).

data _null_; original = "Cats & Dogs 123"; no_andsign = compress(original, '&'); put no_andsign=; /* Result: "Cats Dogs 123" */ run;

In this example, we explicitly provide '&' in the second argument, so only ampersands are removed. Spaces, digits, and other characters remain.

3. Removing All Spaces by Default

If you leave out the second argument entirely, COMPRESS automatically removes all spaces (including blank spaces). Here's a simple demonstration:

data _null_; original = "Hello World "; remove_blanks = compress(original); put remove_blanks=; /* Result: "HelloWorld" */ run;

4. Unleashing the Power of Modifiers

Modifiers make COMPRESS extremely powerful, as they allow you to target entire categories of characters with minimal code. Here are some of the most commonly used modifiers:

Modifier Action
A Removes all letters (alphabetic characters).
D Removes all digits (0-9).
P Removes all punctuation.
S Removes all space characters.
U Removes uppercase letters (A-Z).
L Removes lowercase letters (a-z).
K Keeps only the listed characters, instead of removing them.
i Ignore case when identifying characters to remove.
t Trims trailing blanks before removal.

4.1 Removing Digits

For example, if you want to remove all digits from a string:

data _null_; original = "Sales in 2023 increased by 15%"; remove_digits = compress(original, , 'D'); put remove_digits=; /* Result: "Sales in increased by %" */ run;

Notice that digits only are removed; spaces and other punctuation stay in place.

4.2 Removing Punctuation

Removing punctuation is equally straightforward:

data _null_; original = "Hello, World! 2025."; no_punct = compress(original, , 'P'); put no_punct=; /* Result: "Hello World 2025" */ run;

4.3 Combining Modifiers

You can stack multiple modifiers together. For instance, to remove both digits and punctuation:

data _null_; original = "Item #123, Price: $45.67"; remove_digits_punct = compress(original, , 'DP'); put remove_digits_punct=; /* Result: "Item Price " */ run;

4.4 Using the "Keep" Modifier (K)

Instead of specifying which characters to remove, you can flip the logic and tell SAS which characters to keep using K. For example, to keep only digits:

data _null_; original = "Item #123, Price: $45.67"; keep_digits = compress(original, '0123456789', 'K'); put keep_digits=; /* Result: "1234567" */ run;

Alternatively, combine K with D to shorten your code:

data _null_; original = "Item #123, Price: $45.67"; keep_digits = compress(original, , 'KD'); put keep_digits=; /* Result: "1234567" */ run;

5. Practical Scenarios

  1. Email Cleaning: If you need to remove all punctuation (except “@” and “.”) from an email field, you could selectively keep only those symbols, letters, and digits.
  2. Financial Data: Stripping out currency symbols and punctuation from a price field so you can convert it into numeric form for calculations.
  3. Text Mining: Removing digits or punctuation from survey responses to focus on words alone.

6. Performance Considerations

While COMPRESS is handy, be mindful of its usage on extremely large datasets or within tight loops, as repeated calls can be computationally expensive. It’s still typically faster than manually parsing strings, but always weigh whether you really need to remove these characters or if you can handle them with custom formats or other string functions.

7. Putting it All Together

Here’s a quick snippet that removes punctuation, digits, and trailing spaces all at once:

data _null_; original_str = " Hello, SAS 2025! "; /* - 'P' removes punctuation - 'D' removes digits - 't' trims trailing blanks */ cleaned_str = compress(original_str, , 'PDt'); put cleaned_str=; /* Step-by-step: 1) Remove punctuation => " Hello SAS 2025 " 2) Remove digits => " Hello SAS " 3) Trim trailing => " Hello SAS" */ run;

Notice how a simple combination of modifiers can accomplish multiple clean-up tasks at once, giving you a much tidier dataset in just one line of code (though, of course, you see it here laid out clearly in multiple lines just like SAS EG would present it).

Final Thoughts

Whether you're massaging marketing data, cleaning up survey responses, or extracting numeric values from text-heavy fields, the SAS COMPRESS function has you covered. With its powerful modifiers and flexible syntax, it saves both time and effort, leaving you more space to focus on the analytical heavy lifting. Give it a try in your next data-cleaning project—you might be surprised at how much cleaner your logs (and your data) become!

Posted by StudySAS on studysas.blogpost.com

Popular posts from this blog

SAS Interview Questions and Answers: CDISC, SDTM and ADAM etc

Comparing Two Methods for Removing Formats and Informats in SAS: DATA Step vs. PROC DATASETS

Studyday calculation ( --DY Variable in SDTM)