Solving Non-Printable Characters in AETERM/MHTERM for SDTM Datasets

Managing text variables in SDTM domains such as AETERM (for Adverse Events) or MHTERM (for Medical History) can be challenging when non-printable (hidden) characters sneak in. These characters often arise from external data sources, copy-pasting from emails, encoding mismatches, or raw text that includes ASCII control characters. In this post, we’ll explore methods to detect and remove these problematic characters to ensure your SDTM datasets are submission-ready.

1. Identifying Non-Printable Characters

Non-printable characters generally fall within the ASCII “control” range:

Hex range: 00–1F and 7F
Decimal range: 0–31 and 127

In SAS, you can detect these characters by examining their ASCII values using RANK(), or by leveraging built-in functions like ANYCNTRL(). Below is an example snippet that loops through the first 100 observations of AETERM, logs the position of any non-printable character, and displays its ASCII rank:


data check_chars;
   set yourlib.ae (obs=100);

   /* For demonstration, adjust these lengths to fit your actual data. */
   length test_char $1 non_print_char_flag $200;

   do i = 1 to length(aeterm);
      test_char = substr(aeterm, i, 1);

      /* Check for non-printable ASCII control characters (0–31, 127) */
      if rank(test_char) < 32 or rank(test_char) = 127 then do;

         /* Build a single message string */
         non_print_char_flag = catx(' ',
            'Non-printable found in USUBJID=', usubjid,
            'at position=', put(i, best.),
            'character=', test_char,
            'rank=', put(rank(test_char), best.)
         );

         /* Write the message string to the SAS log */
         put non_print_char_flag;
      end;
   end;
run;

2. Removing Non-Printable Characters

Once you confirm non-printable characters are present, you can remove them in various ways. Below are three common approaches:

A. Using COMPRESS with Character Classes

The simplest way is to use the COMPRESS function with the 'c' modifier, which removes control characters (ASCII 0–31, 127):


data clean;
  set yourlib.ae;
  /*aeterm_clean = compress(aeterm, , 'c'); */
   aeterm_clean = compress(aeterm, , 'kw'); 
  /* 'c' removes control characters (ASCII 0–31, 127) */
run;

B. Using a Perl Regular Expression (PRXCHANGE)

A more targeted approach uses PRXPARSE and PRXCHANGE. For instance, the following regex removes control characters in the ranges 00–08, 0B, 0C, 0E–1F, and 7F:


data clean;
  set yourlib.ae;
  /* Remove ASCII 00–08, 0B, 0C, 0E–1F, and 7F */
  retain re_removeControls PRXPARSE('s/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+//o');
  aeterm_clean = prxchange(re_removeControls, -1, aeterm);
run;

C. Using TRANWRD Iteratively

For legacy or very narrow use cases, you might remove characters with multiple TRANWRD() calls. However, this approach quickly becomes cumbersome if many different ASCII control characters need to be removed.

3. Incorporating Into SDTM Mapping Programs

Typically, these solutions are applied during the data transformation from raw data to final SDTM domains. For instance, in creating your AE domain:


data sdtm.ae;
  set raw.ae;

  /* Remove non-printable characters from AETERM */
  AETERM = compress(AETERM, , 'c'); 

  /* Additional mappings and derivations here */

run;

You can do the same in other domains (e.g., MH, CM) for consistent data cleaning.

4. Additional Tips

    Strip leading/trailing spaces: After removing hidden characters, consider using 
            STRIP() or LEFT()/RIGHT() to ensure no unintended spaces remain.
Compress multiple blanks: If control character removal results in extra spaces, 
            COMPBL() can reduce multiple blanks to a single space.
Document your approach: Regulatory bodies often require justification that data cleaning 
            preserves the meaning of reported terms. Keep clear records of any cleaning steps performed.
Use consistently: Apply the same cleaning methodology across all relevant 
            domains to avoid inconsistencies.

By following these steps, you’ll ensure cleaner, more compliant SDTM datasets, minimize the risk of downstream submission issues, and maintain higher data quality for your clinical studies.

Posted by StudySAS on studysas.blogpost.com