Character Encoding, Japanese Text, and Why Your SDTM Package Can Fail Even When the Data Logic Is Fine

Your SDTM derivations are correct.
Your P21 run is clean.
Your define.xml opens and looks fine.

And yet, the package still trips up in review.

Not because of the data.
Because of encoding.

PMDA expectation

What PMDA expects, and why it trips teams

PMDA’s Technical Conformance Guide states that if languages other than English are used, including Japanese, the character set and encoding scheme must be documented in the reviewer’s guide.

Source: PMDA Technical Conformance Guide on Electronic Study Data Submissions, April 2024

This is not a footnote. It shows up in real submissions when XML, metadata, or reviewer tools fail to render text consistently.

Common misunderstanding

The key misunderstanding

Many teams assume:

“We’ll just use ASCII.”

The actual expectation is:

Use Unicode, typically UTF-8, as the working encoding
Restrict dataset content to ASCII-compatible characters where required

These are not the same thing.

Root constraint

The XPT format limitation that drives the whole problem

SDTM datasets are submitted in SAS Transport v5 (XPT) format.

That format:

was designed for US-ASCII exchange
has no encoding metadata in the file header
does not tell the receiver what encoding was used

The receiving system has to guess the encoding.

When SAS opens an XPT file created in a different encoding, it attempts transcoding.

That can result in:

WARNING: Some character data was lost during transcoding in the dataset.

That message does not identify the variable or the observation.

Your data can be corrupted silently.

Programming consequence

Byte limits, not character limits

XPT constraints are in bytes, not characters.

For UTF-8 encoded Japanese text, one character typically uses about 3 bytes.

Practical effect:

40-byte label → about 13 Japanese characters
200-byte variable → about 66 Japanese characters

This affects:

variable lengths
labels
define.xml metadata
macro logic that assumes character counts

Failure points

Where encoding actually breaks

Encoding problems do not usually show up during mapping. They surface later:

XML rendering
stylesheet loading
reviewer-side tools
XPT read and write

Common failure patterns include:

define.xml renders incorrectly
XML parsing fails
dataset comments become unreadable
XPT import causes truncation

Pipeline cracks

Where the cracks really come from

SAS session encoding

If SAS is not UTF-8:

transcoding occurs
data may be altered silently

Recommended setup: SAS session encoding = UTF-8

XML generation

XML requires strict consistency:

<?xml version="1.0" encoding="UTF-8"?>

If the declared encoding does not match the actual encoding, parsing issues follow.

External tools

Excel, Notepad, and XML editors often change encoding silently or introduce hidden characters.

Manual edits

Opening and saving XML manually can change encoding without warning.

Copy and paste risk

Copying from Word or email can introduce hidden non-ASCII characters.

The non-ASCII scan macro below will surface these characters before handoff. That is why running it late in the build cycle, not just at the start, matters.

Build-chain warning

CPORT is not XPT

Do not use PROC CPORT for submission datasets.

Even if the file extension is .xpt, CPORT does not create XPT v5 format.

PMDA gateway cannot process CPORT files.

Source: Pinnacle 21 Help Center, PMDA Engine Update 2211.0

Correct build pattern

How to generate XPT correctly

Use the LIBNAME XPORT engine:

libname out xport '/path/to/output/ae.xpt'; proc copy in=mylib out=out; select ae; run; libname out clear;

Why UTF-8 wins

Why UTF-8 is the correct approach

UTF-8 is not just about consistency. It matches the rest of the submission pipeline:

EDC systems are typically Unicode
define.xml is XML with UTF-8 declaration
Pinnacle 21 runs in a Unicode session

Using UTF-8 end to end avoids transcoding and reduces the risk of silent data loss.

Japanese text path

What PMDA expects when Japanese text is involved

PMDA allows two paths.

If translation does not lose meaning

Submit the English-translated dataset.

If translation would lose meaning

Submit both:

the Japanese dataset
the English-translated version

This is the correct alternative to simply saying “just use ASCII.”

Practical SAS check

Scan for non-ASCII characters before handoff

%macro check_nonascii(lib=, dsn=); data non_ascii_check; set &lib..&dsn; array _char _character_; do i = 1 to dim(_char); if prxmatch('/[^\x00-\x7F]/', _char{i}) then do; dataset = "&dsn"; variable = vname(_char{i}); value = _char{i}; output; end; end; keep dataset variable value; run; %mend check_nonascii;

This gives you a repeatable way to surface non-ASCII content before handoff.

Reviewer guide

What PMDA expects in practice

PMDA expects:

encoding clearly documented
character set explained
consistency across all files

The reviewer guide should include:

SAS session encoding
XML encoding, typically UTF-8
dataset character constraints
handling of non-English text

Example language

Example reviewer guide note

Character Encoding: All datasets and metadata files were generated using UTF-8 encoding. Dataset content is restricted to ASCII-compatible characters. Japanese text, where required, is handled per PMDA guidance and represented consistently across datasets and metadata. All XML files include explicit encoding declarations.

Why this matters

FDA workflows are mostly English, so encoding problems are less visible until something breaks downstream.

PMDA workflows often include Japanese and explicitly require encoding clarity, which makes encoding a submission risk, not just a technical detail.

Encoding is one of the few areas where your data can be completely correct and your submission can still fail.

Sources

References

PMDA Technical Conformance Guide (April 2024)
PMDA Electronic Study Data Review Page
Pinnacle 21 Help Center — PMDA Engine Update 2211.0
FDA Study Data Technical Conformance Guide

STUDYSAS BLOG