Solving Non-Printable Characters in AETERM/MHTERM for SDTM Datasets
Solving Non-Printable Characters in AETERM/MHTERM for SDTM Datasets
Managing text variables in SDTM domains such as AETERM
(for Adverse Events) or
MHTERM
(for Medical History) can be challenging when non-printable (hidden) characters sneak in.
These characters often arise from external data sources, copy-pasting from emails, encoding mismatches, or raw text
that includes ASCII control characters. In this post, we’ll explore methods to detect and remove
these problematic characters to ensure your SDTM datasets are submission-ready.
1. Identifying Non-Printable Characters
Non-printable characters generally fall within the ASCII “control” range:
- Hex range:
00
–1F
and7F
- Decimal range:
0
–31
and127
In SAS, you can detect these characters by examining their ASCII values using RANK()
, or by leveraging
built-in functions like ANYCNTRL()
. Below is an example snippet that loops through the first 100
observations of AETERM
, logs the position of any non-printable character, and displays its ASCII rank:
data check_chars;
set yourlib.ae (obs=100);
/* For demonstration, adjust these lengths to fit your actual data. */
length test_char $1 non_print_char_flag $200;
do i = 1 to length(aeterm);
test_char = substr(aeterm, i, 1);
/* Check for non-printable ASCII control characters (0–31, 127) */
if rank(test_char) < 32 or rank(test_char) = 127 then do;
/* Build a single message string */
non_print_char_flag = catx(' ',
'Non-printable found in USUBJID=', usubjid,
'at position=', put(i, best.),
'character=', test_char,
'rank=', put(rank(test_char), best.)
);
/* Write the message string to the SAS log */
put non_print_char_flag;
end;
end;
run;
2. Removing Non-Printable Characters
Once you confirm non-printable characters are present, you can remove them in various ways. Below are three common approaches:
A. Using COMPRESS with Character Classes
The simplest way is to use the COMPRESS
function with the 'c'
modifier, which removes
control characters (ASCII 0–31, 127):
data clean;
set yourlib.ae;
/*aeterm_clean = compress(aeterm, , 'c'); */
aeterm_clean = compress(aeterm, , 'kw');
/* 'c' removes control characters (ASCII 0–31, 127) */
run;
B. Using a Perl Regular Expression (PRXCHANGE)
A more targeted approach uses PRXPARSE
and PRXCHANGE
. For instance, the following
regex removes control characters in the ranges 00–08
, 0B
, 0C
,
0E–1F
, and 7F
:
data clean;
set yourlib.ae;
/* Remove ASCII 00–08, 0B, 0C, 0E–1F, and 7F */
retain re_removeControls PRXPARSE('s/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+//o');
aeterm_clean = prxchange(re_removeControls, -1, aeterm);
run;
C. Using TRANWRD Iteratively
For legacy or very narrow use cases, you might remove characters with multiple TRANWRD()
calls.
However, this approach quickly becomes cumbersome if many different ASCII control characters need to be removed.
3. Incorporating Into SDTM Mapping Programs
Typically, these solutions are applied during the data transformation from raw data to final SDTM domains.
For instance, in creating your AE
domain:
data sdtm.ae;
set raw.ae;
/* Remove non-printable characters from AETERM */
AETERM = compress(AETERM, , 'c');
/* Additional mappings and derivations here */
run;
You can do the same in other domains (e.g., MH
, CM
) for consistent data cleaning.
4. Additional Tips
- Strip leading/trailing spaces: After removing hidden characters, consider using
STRIP()
orLEFT()
/RIGHT()
to ensure no unintended spaces remain. - Compress multiple blanks: If control character removal results in extra spaces,
COMPBL()
can reduce multiple blanks to a single space. - Document your approach: Regulatory bodies often require justification that data cleaning preserves the meaning of reported terms. Keep clear records of any cleaning steps performed.
- Use consistently: Apply the same cleaning methodology across all relevant domains to avoid inconsistencies.
By following these steps, you’ll ensure cleaner, more compliant SDTM datasets, minimize the risk of downstream submission issues, and maintain higher data quality for your clinical studies.
Posted by StudySAS on studysas.blogpost.com