Character Encoding, Japanese Text, and Why Your SDTM Package Can Fail Even When the Data Logic Is Fine
Character Encoding, Japanese Text and Why Your SDTM Package Can Fail Even When the Data Logic Is Fine
Clean derivations and a valid define.xml are not enough if the transport layer, XML encoding, and metadata pipeline are not controlled end to end.
Your SDTM derivations are correct.
Your P21 run is clean.
Your define.xml opens and looks fine.
And yet, the package still trips up in review.
Not because of the data.
Because of encoding.
What PMDA expects, and why it trips teams
PMDA’s Technical Conformance Guide states that if languages other than English are used, including Japanese, the character set and encoding scheme must be documented in the reviewer’s guide.
Source: PMDA Technical Conformance Guide on Electronic Study Data Submissions, April 2024
This is not a footnote. It shows up in real submissions when XML, metadata, or reviewer tools fail to render text consistently.
The key misunderstanding
Many teams assume:
The actual expectation is:
- Use Unicode, typically UTF-8, as the working encoding
- Restrict dataset content to ASCII-compatible characters where required
These are not the same thing.
The XPT format limitation that drives the whole problem
SDTM datasets are submitted in SAS Transport v5 (XPT) format.
That format:
- was designed for US-ASCII exchange
- has no encoding metadata in the file header
- does not tell the receiver what encoding was used
When SAS opens an XPT file created in a different encoding, it attempts transcoding.
That can result in:
That message does not identify the variable or the observation.
Your data can be corrupted silently.
Byte limits, not character limits
XPT constraints are in bytes, not characters.
For UTF-8 encoded Japanese text, one character typically uses about 3 bytes.
40-byte label → about 13 Japanese characters
200-byte variable → about 66 Japanese characters
This affects:
- variable lengths
- labels
- define.xml metadata
- macro logic that assumes character counts
Where encoding actually breaks
Encoding problems do not usually show up during mapping. They surface later:
- XML rendering
- stylesheet loading
- reviewer-side tools
- XPT read and write
Common failure patterns include:
- define.xml renders incorrectly
- XML parsing fails
- dataset comments become unreadable
- XPT import causes truncation
Where the cracks really come from
SAS session encoding
If SAS is not UTF-8:
- transcoding occurs
- data may be altered silently
XML generation
XML requires strict consistency:
If the declared encoding does not match the actual encoding, parsing issues follow.
External tools
Excel, Notepad, and XML editors often change encoding silently or introduce hidden characters.
Manual edits
Opening and saving XML manually can change encoding without warning.
Copy and paste risk
Copying from Word or email can introduce hidden non-ASCII characters.
CPORT is not XPT
Do not use PROC CPORT for submission datasets.
Even if the file extension is .xpt, CPORT does not create XPT v5 format.
Source: Pinnacle 21 Help Center, PMDA Engine Update 2211.0
How to generate XPT correctly
Use the LIBNAME XPORT engine:
Why UTF-8 is the correct approach
UTF-8 is not just about consistency. It matches the rest of the submission pipeline:
- EDC systems are typically Unicode
- define.xml is XML with UTF-8 declaration
- Pinnacle 21 runs in a Unicode session
What PMDA expects when Japanese text is involved
PMDA allows two paths.
If translation does not lose meaning
Submit the English-translated dataset.
If translation would lose meaning
Submit both:
- the Japanese dataset
- the English-translated version
This is the correct alternative to simply saying “just use ASCII.”
Scan for non-ASCII characters before handoff
This gives you a repeatable way to surface non-ASCII content before handoff.
What PMDA expects in practice
PMDA expects:
- encoding clearly documented
- character set explained
- consistency across all files
The reviewer guide should include:
- SAS session encoding
- XML encoding, typically UTF-8
- dataset character constraints
- handling of non-English text
Example reviewer guide note
Why this matters
FDA workflows are mostly English, so encoding problems are less visible until something breaks downstream.
PMDA workflows often include Japanese and explicitly require encoding clarity, which makes encoding a submission risk, not just a technical detail.
Encoding is one of the few areas where your data can be completely correct and your submission can still fail.
References
- PMDA Technical Conformance Guide (April 2024)
- PMDA Electronic Study Data Review Page
- Pinnacle 21 Help Center — PMDA Engine Update 2211.0
- FDA Study Data Technical Conformance Guide