Character Encoding, Japanese Text, and Why Your SDTM Package Can Fail Even When the Data Logic Is Fine

StudySAS • SDTM • Define.xml • Regulatory Submissions

Character Encoding, Japanese Text and Why Your SDTM Package Can Fail Even When the Data Logic Is Fine

Clean derivations and a valid define.xml are not enough if the transport layer, XML encoding, and metadata pipeline are not controlled end to end.

Your SDTM derivations are correct.
Your P21 run is clean.
Your define.xml opens and looks fine.

And yet, the package still trips up in review.

Not because of the data.
Because of encoding.

What PMDA expects, and why it trips teams

PMDA’s Technical Conformance Guide states that if languages other than English are used, including Japanese, the character set and encoding scheme must be documented in the reviewer’s guide.

Source: PMDA Technical Conformance Guide on Electronic Study Data Submissions, April 2024

This is not a footnote. It shows up in real submissions when XML, metadata, or reviewer tools fail to render text consistently.

The key misunderstanding

Many teams assume:

“We’ll just use ASCII.”

The actual expectation is:

  • Use Unicode, typically UTF-8, as the working encoding
  • Restrict dataset content to ASCII-compatible characters where required

These are not the same thing.

The XPT format limitation that drives the whole problem

SDTM datasets are submitted in SAS Transport v5 (XPT) format.

That format:

  • was designed for US-ASCII exchange
  • has no encoding metadata in the file header
  • does not tell the receiver what encoding was used
The receiving system has to guess the encoding.

When SAS opens an XPT file created in a different encoding, it attempts transcoding.

That can result in:

WARNING: Some character data was lost during transcoding in the dataset.

That message does not identify the variable or the observation.

Your data can be corrupted silently.

Byte limits, not character limits

XPT constraints are in bytes, not characters.

For UTF-8 encoded Japanese text, one character typically uses about 3 bytes.

Practical effect:

40-byte label → about 13 Japanese characters
200-byte variable → about 66 Japanese characters

This affects:

  • variable lengths
  • labels
  • define.xml metadata
  • macro logic that assumes character counts

Where encoding actually breaks

Encoding problems do not usually show up during mapping. They surface later:

  • XML rendering
  • stylesheet loading
  • reviewer-side tools
  • XPT read and write

Common failure patterns include:

  • define.xml renders incorrectly
  • XML parsing fails
  • dataset comments become unreadable
  • XPT import causes truncation

Where the cracks really come from

SAS session encoding

If SAS is not UTF-8:

  • transcoding occurs
  • data may be altered silently
Recommended setup: SAS session encoding = UTF-8

XML generation

XML requires strict consistency:

<?xml version="1.0" encoding="UTF-8"?>

If the declared encoding does not match the actual encoding, parsing issues follow.

External tools

Excel, Notepad, and XML editors often change encoding silently or introduce hidden characters.

Manual edits

Opening and saving XML manually can change encoding without warning.

Copy and paste risk

Copying from Word or email can introduce hidden non-ASCII characters.

The non-ASCII scan macro below will surface these characters before handoff. That is why running it late in the build cycle, not just at the start, matters.

CPORT is not XPT

Do not use PROC CPORT for submission datasets.

Even if the file extension is .xpt, CPORT does not create XPT v5 format.

PMDA gateway cannot process CPORT files.

Source: Pinnacle 21 Help Center, PMDA Engine Update 2211.0

How to generate XPT correctly

Use the LIBNAME XPORT engine:

libname out xport '/path/to/output/ae.xpt'; proc copy in=mylib out=out; select ae; run; libname out clear;

Why UTF-8 is the correct approach

UTF-8 is not just about consistency. It matches the rest of the submission pipeline:

  • EDC systems are typically Unicode
  • define.xml is XML with UTF-8 declaration
  • Pinnacle 21 runs in a Unicode session
Using UTF-8 end to end avoids transcoding and reduces the risk of silent data loss.

What PMDA expects when Japanese text is involved

PMDA allows two paths.

If translation does not lose meaning

Submit the English-translated dataset.

If translation would lose meaning

Submit both:

  • the Japanese dataset
  • the English-translated version

This is the correct alternative to simply saying “just use ASCII.”

Scan for non-ASCII characters before handoff

%macro check_nonascii(lib=, dsn=); data non_ascii_check; set &lib..&dsn; array _char _character_; do i = 1 to dim(_char); if prxmatch('/[^\x00-\x7F]/', _char{i}) then do; dataset = "&dsn"; variable = vname(_char{i}); value = _char{i}; output; end; end; keep dataset variable value; run; %mend check_nonascii;

This gives you a repeatable way to surface non-ASCII content before handoff.

What PMDA expects in practice

PMDA expects:

  • encoding clearly documented
  • character set explained
  • consistency across all files

The reviewer guide should include:

  • SAS session encoding
  • XML encoding, typically UTF-8
  • dataset character constraints
  • handling of non-English text

Example reviewer guide note

Character Encoding: All datasets and metadata files were generated using UTF-8 encoding. Dataset content is restricted to ASCII-compatible characters. Japanese text, where required, is handled per PMDA guidance and represented consistently across datasets and metadata. All XML files include explicit encoding declarations.

Why this matters

FDA workflows are mostly English, so encoding problems are less visible until something breaks downstream.

PMDA workflows often include Japanese and explicitly require encoding clarity, which makes encoding a submission risk, not just a technical detail.

Encoding is one of the few areas where your data can be completely correct and your submission can still fail.

References