Discover More Tips and Techniques on This Blog

Ensuring Data Quality with SAS: Checking for Non-ASCII Characters

Ensuring Data Quality with SAS: Checking for Non-ASCII Characters

Author: Sarath Annapareddy

Date: September 2, 2024

Introduction

In the world of data processing, ensuring the integrity and cleanliness of your datasets is paramount. One of the common issues that data engineers and analysts face is the presence of non-ASCII or non-printable characters within their datasets. These characters can cause a wide range of problems, from data corruption to failures in downstream processing and reporting. To address this, I developed a SAS program named check_non_ascii.sas, which not only identifies these problematic characters but also streamlines the data by removing any completely blank variables.

The Problem with Non-ASCII Characters

ASCII (American Standard Code for Information Interchange) is a character encoding standard used to represent text in computers and other devices. While ASCII characters are universally understood and processed, non-ASCII characters can lead to significant issues, especially when data needs to be transferred between systems or when it’s used in different applications. Non-ASCII characters include special symbols, accented characters, or any characters with a code point above 127.

These characters might be introduced into your data through various means such as manual data entry errors, system incompatibilities, or even due to copy-paste operations from external sources. If left unchecked, they can lead to:

  • Data processing errors and job failures.
  • Incorrect data interpretation in reporting tools.
  • Challenges in data migration between different systems.
  • Complications in data analysis and interpretation.

The check_non_ascii.sas Program

The check_non_ascii.sas program is a powerful tool designed to identify and report instances of non-ASCII characters in any character variable across all datasets within a specified SAS library. Additionally, it enhances the quality of the output by automatically dropping any variables that are completely blank.

Program Code

Below is the complete SAS program code for check_non_ascii.sas:


/*****************************************************************************************
Program Name  : check_non_ascii.sas
Author        : Sarath Annapareddy
Creation Date : 02SEP2024
SAS Version   : 9.4
Platform      : UNIX/Linux

Purpose       : This program checks for the presence of non-ASCII or non-printable characters 
                in all character variables across all datasets in a specified library. 
                The program also drops variables that are completely blank in the resulting 
                output dataset.

Input         : - A SAS library containing the datasets to be checked.

Output        : - A consolidated dataset (`check_non_ascii`) that contains any instances 
                  of non-ASCII or non-printable characters, along with the dataset name, 
                  variable name, invalid character, and position.
                - Variables that are completely blank are automatically dropped from 
                  the output dataset.

Special Notes : - Ensure that the library name (`your_libname`) is updated to the correct library 
                  reference before executing the program.
                - This program assumes that all datasets in the specified library are to be checked.

Modification History:
---------------------------------------------------------------------------------------------------
| Date       | Author  | Description                                                              |
|------------|---------|--------------------------------------------------------------------------|
| 02SEP2024  | Sarath  | Initial version                                                          |
|            |         |                                                                          |
---------------------------------------------------------------------------------------------------

*****************************************************************************************/

%let libname = Yourlib; /* Replace with your library name */
%let output_dataset = check_non_ascii; /* Name of the output dataset */

/* Step 1: Get a list of all datasets in the library */
proc sql noprint;
    select memname into :dataset_list separated by ' '
    from sashelp.vtable
    where libname = upcase("&libname") and memtype='DATA';
quit;

/* Step 2: Macro to check each dataset */
%macro check_non_ascii;
    %do i = 1 %to %sysfunc(countw(&dataset_list));
        %let dataset = %scan(&dataset_list, &i);

        /* Step 3: Create a dataset to capture non-ASCII characters */
        data &output_dataset._&dataset;
            set &libname..&dataset;
            array char_vars _character_;

            length dataset $32 varname $32 invalid_char $1 position 8;

            do i = 1 to dim(char_vars);
                varname = vname(char_vars[i]);
                do position = 1 to length(char_vars[i]);
                    invalid_char = substr(char_vars[i], position, 1);
                    if rank(invalid_char) > 127 or rank(invalid_char) = 0 then do;
                        dataset = "&dataset";
                        output;
                    end;
                end;
            end;
            drop i;
        run;

    %end;

    /* Step 4: Combine results into one dataset */
    data &output_dataset;
        set
        %do i = 1 %to %sysfunc(countw(&dataset_list));
            &output_dataset._%scan(&dataset_list, &i)
        %end;
        ;
    run;

    /* Step 5: Drop completely blank variables */
    proc transpose data=&output_dataset out=_transposed_data(drop=_name_);
        var _all_;
    run;

    proc sql noprint;
        select _name_ into :drop_list separated by ' '
        from (
            select _name_, sum(col1 ^= '') as not_blank_count
            from _transposed_data
            group by _name_
            having not_blank_count = 0
        );
    quit;

    %if &sqlobs > 0 %then %do;
        data &output_dataset;
            set &output_dataset;
            drop &drop_list;
        run;
    %end;

    /* Step 6: Clean up temporary datasets */
    proc datasets library=work nolist;
        delete _transposed_data 
        %do i = 1 %to %sysfunc(countw(&dataset_list));
            &output_dataset._%scan(&dataset_list, &i)
        %end;
        ;
    quit;

%mend check_non_ascii;

/* Step 7: Execute the macro */
%check_non_ascii;

/* Step 8: Delete completely blank variables */
%macro delmissvars(in=, out=);

    /* Step 1: Extract all variable names of the input data set */
    proc contents data=&in out=_temp(keep=name) noprint;
    run;

    /* Step 2: Create a list of variable names */
    data _null_;
        set _temp;
        call symputx(cats('var', _n_), name, 'L');
        call symputx('n_var', _n_, 'L');
    run;

    /* Initialize drop variable list */
    %let drop=;

    /* Step 3: Loop over each variable to check for missing values */
    %do i=1 %to &n_var;

        /* Step 4: Use PROC SQL to count non-missing values for the variable */
        proc sql noprint;
            select count(*) into: num_&i from &in where &&var&i is not missing;
        quit;

        /* Step 5: If all values are missing, add the variable to the drop list */
        %if &&num_&i = 0 %then %do;
            %let drop = &drop &&var&i;
        %end;

    %end;

    /* Step 6: Drop variables that are completely missing */
    data &out;
        set ∈
        drop &drop;
    run;

    /* Step 7: Report the result */
    %if %superq(drop) = %str() %then %do;
        %put No variables were deleted from dataset ∈
    %end;
    %else %do;
        %put Variables were dropped: &drop from dataset ∈
    %end;

%mend delmissvars;

%delmissvars(in=check_non_ascii, out=check_non_ascii);
    

Key Features

  • Comprehensive Search: The program scans all datasets within the specified library, ensuring no dataset is overlooked.
  • Detailed Reporting: For each non-ASCII character found, the program records the dataset name, variable name, the invalid character, and its position within the string.
  • Data Cleanup: The program automatically drops any variables that are completely blank in the resulting dataset, helping to streamline the data for further processing.
  • Easy to Use: The program is straightforward to implement, requiring only the

Disclosure:

In the spirit of transparency and innovation, I want to share that some of the content on this blog is generated with the assistance of ChatGPT, an AI language model developed by OpenAI. While I use this tool to help brainstorm ideas and draft content, every post is carefully reviewed, edited, and personalized by me to ensure it aligns with my voice, values, and the needs of my readers. My goal is to provide you with accurate, valuable, and engaging content, and I believe that using AI as a creative aid helps achieve that. If you have any questions or feedback about this approach, feel free to reach out. Your trust and satisfaction are my top priorities.