Friday, February 24, 2012

Transcoding Problem: Option (correctencoding=wlatin1)

Have you ever tried to convert the default encoding to Wlatin1 (Windows SAS Session Encoding)?

Let me tell you the story behind writing this post….

Today I was asked to send SAS datasets to one of the client. I transferred the SAS datasets to the client and immediately after, I got an email from the so called client saying the encoding of SAS datasets is different this time when compared with the last transfer. He said It’s causing problems in Proc compare process.

Opps… bummer…. Client’s email got me little worried ...
I checked the Proc contents details and saw the change in the encoding. I investigated the issue and found out that Unicode SAS with UTF-8 encoding uses 1 to 4 bytes to handle Unicode data. It is possible for the 8-bit characters to be expanded by 2 to 3 bytes when transcoding occurs, which causes the truncation error.

Because of the truncation problem I was asked to change the unicoding back to WLATIN1 so that the character data present in the SAS datasets represents the US and Europe characters in windows.

Here is the code to do that.

proc datasets lib=SDTM;
modify supplb/correctencoding=wlatin1;

Unicode Basics:

Unicode is the universal character encoding that supports the interchange, processing, and display of
characters and symbols found in the world’s writing systems. Other character encodings are limited to
subsets of all languages. 

Transcoding problem:

Currently, SAS/ACCESS Interface to ODBC cannot support Unicode. DBCS data and single-byte
non-ASCII characters cannot be correctly processed.

In a SAS UTF-8 session, data imported appear as question marks. Although the encoding property of
imported data is UTF-8, the real encoding of the data is not changed to UTF-8.


/* Set the correct encoding to ‘WLATIN1’ */
proc datasets lib=stagings;
modify custtype_w1/correctencoding=wlatin1;

/* Transcode data from ‘Wlatin1’ to ‘UTF-8’ */
data stagings.custtype_u8;
set stagings.custtype_w1;