Ensuring Data Quality with SAS: Checking for Non-ASCII Characters
Author: Sarath Annapareddy
Date: September 2, 2024
Introduction
In the world of data processing, ensuring the integrity and cleanliness of your datasets is paramount. One of the common issues that data engineers and analysts face is the presence of non-ASCII or non-printable characters within their datasets. These characters can cause a wide range of problems, from data corruption to failures in downstream processing and reporting. To address this, I developed a SAS program named check_non_ascii.sas
, which not only identifies these problematic characters but also streamlines the data by removing any completely blank variables.
The Problem with Non-ASCII Characters
ASCII (American Standard Code for Information Interchange) is a character encoding standard used to represent text in computers and other devices. While ASCII characters are universally understood and processed, non-ASCII characters can lead to significant issues, especially when data needs to be transferred between systems or when it’s used in different applications. Non-ASCII characters include special symbols, accented characters, or any characters with a code point above 127.
These characters might be introduced into your data through various means such as manual data entry errors, system incompatibilities, or even due to copy-paste operations from external sources. If left unchecked, they can lead to:
- Data processing errors and job failures.
- Incorrect data interpretation in reporting tools.
- Challenges in data migration between different systems.
- Complications in data analysis and interpretation.
The check_non_ascii.sas
Program
The check_non_ascii.sas
program is a powerful tool designed to identify and report instances of non-ASCII characters in any character variable across all datasets within a specified SAS library. Additionally, it enhances the quality of the output by automatically dropping any variables that are completely blank.
Program Code
Below is the complete SAS program code for check_non_ascii.sas
:
/*****************************************************************************************
Program Name : check_non_ascii.sas
Author : Sarath Annapareddy
Creation Date : 02SEP2024
SAS Version : 9.4
Platform : UNIX/Linux
Purpose : This program checks for the presence of non-ASCII or non-printable characters
in all character variables across all datasets in a specified library.
The program also drops variables that are completely blank in the resulting
output dataset.
Input : - A SAS library containing the datasets to be checked.
Output : - A consolidated dataset (`check_non_ascii`) that contains any instances
of non-ASCII or non-printable characters, along with the dataset name,
variable name, invalid character, and position.
- Variables that are completely blank are automatically dropped from
the output dataset.
Special Notes : - Ensure that the library name (`your_libname`) is updated to the correct library
reference before executing the program.
- This program assumes that all datasets in the specified library are to be checked.
Modification History:
---------------------------------------------------------------------------------------------------
| Date | Author | Description |
|------------|---------|--------------------------------------------------------------------------|
| 02SEP2024 | Sarath | Initial version |
| | | |
---------------------------------------------------------------------------------------------------
*****************************************************************************************/
%let libname = Yourlib; /* Replace with your library name */
%let output_dataset = check_non_ascii; /* Name of the output dataset */
/* Step 1: Get a list of all datasets in the library */
proc sql noprint;
select memname into :dataset_list separated by ' '
from sashelp.vtable
where libname = upcase("&libname") and memtype='DATA';
quit;
/* Step 2: Macro to check each dataset */
%macro check_non_ascii;
%do i = 1 %to %sysfunc(countw(&dataset_list));
%let dataset = %scan(&dataset_list, &i);
/* Step 3: Create a dataset to capture non-ASCII characters */
data &output_dataset._&dataset;
set &libname..&dataset;
array char_vars _character_;
length dataset $32 varname $32 invalid_char $1 position 8;
do i = 1 to dim(char_vars);
varname = vname(char_vars[i]);
do position = 1 to length(char_vars[i]);
invalid_char = substr(char_vars[i], position, 1);
if rank(invalid_char) > 127 or rank(invalid_char) = 0 then do;
dataset = "&dataset";
output;
end;
end;
end;
drop i;
run;
%end;
/* Step 4: Combine results into one dataset */
data &output_dataset;
set
%do i = 1 %to %sysfunc(countw(&dataset_list));
&output_dataset._%scan(&dataset_list, &i)
%end;
;
run;
/* Step 5: Drop completely blank variables */
proc transpose data=&output_dataset out=_transposed_data(drop=_name_);
var _all_;
run;
proc sql noprint;
select _name_ into :drop_list separated by ' '
from (
select _name_, sum(col1 ^= '') as not_blank_count
from _transposed_data
group by _name_
having not_blank_count = 0
);
quit;
%if &sqlobs > 0 %then %do;
data &output_dataset;
set &output_dataset;
drop &drop_list;
run;
%end;
/* Step 6: Clean up temporary datasets */
proc datasets library=work nolist;
delete _transposed_data
%do i = 1 %to %sysfunc(countw(&dataset_list));
&output_dataset._%scan(&dataset_list, &i)
%end;
;
quit;
%mend check_non_ascii;
/* Step 7: Execute the macro */
%check_non_ascii;
/* Step 8: Delete completely blank variables */
%macro delmissvars(in=, out=);
/* Step 1: Extract all variable names of the input data set */
proc contents data=&in out=_temp(keep=name) noprint;
run;
/* Step 2: Create a list of variable names */
data _null_;
set _temp;
call symputx(cats('var', _n_), name, 'L');
call symputx('n_var', _n_, 'L');
run;
/* Initialize drop variable list */
%let drop=;
/* Step 3: Loop over each variable to check for missing values */
%do i=1 %to &n_var;
/* Step 4: Use PROC SQL to count non-missing values for the variable */
proc sql noprint;
select count(*) into: num_&i from &in where &&var&i is not missing;
quit;
/* Step 5: If all values are missing, add the variable to the drop list */
%if &&num_&i = 0 %then %do;
%let drop = &drop &&var&i;
%end;
%end;
/* Step 6: Drop variables that are completely missing */
data &out;
set ∈
drop &drop;
run;
/* Step 7: Report the result */
%if %superq(drop) = %str() %then %do;
%put No variables were deleted from dataset ∈
%end;
%else %do;
%put Variables were dropped: &drop from dataset ∈
%end;
%mend delmissvars;
%delmissvars(in=check_non_ascii, out=check_non_ascii);
Key Features
- Comprehensive Search: The program scans all datasets within the specified library, ensuring no dataset is overlooked.
- Detailed Reporting: For each non-ASCII character found, the program records the dataset name, variable name, the invalid character, and its position within the string.
- Data Cleanup: The program automatically drops any variables that are completely blank in the resulting dataset, helping to streamline the data for further processing.
- Easy to Use: The program is straightforward to implement, requiring only the