Your SDTM Passed Validation. That Doesn’t Mean You’re Safe.

StudySAS Blog

Your SDTM Passed Validation. That Doesn’t Mean You’re Safe.

Why clean Pinnacle 21 results do not always mean your SDTM package is ready for review, and why define.xml still decides how quickly a reviewer can understand and trust your data.

Most teams celebrate when Pinnacle 21 is clean.

That makes sense. It feels like the hard part is over.

But regulators do not review submissions that way.

They start with define.xml.

Across repeated submission work, one pattern becomes obvious.

Clean datasets get you submitted.
Clear metadata gets you through review.

Figure 1. What teams think vs what reviewers actually do

A simple process view of the gap between validation completion and actual reviewer workflow.

What reviewers actually do first

Before they ever look at code, reviewers usually follow a simple path:

Open define.xml
Search for a variable or derivation rule
Read origin, comments, method, and value-level metadata
Decide whether the logic is clear enough to trust
Go to SDRG, ADRG, or programs only if something is still unclear

If define.xml is vague, questions start early. Not because the programming is wrong, but because the reviewer cannot safely infer what you meant.

Practical point

A clean validation report tells you the package is technically acceptable. It does not tell you the metadata is reviewer-friendly.

A real example from SDTM LB

Here is the kind of define.xml statement many teams use for an SDTM Findings flag:

Last observation before exposure flag is assigned to the last non-missing result prior to treatment.

On paper, that looks fine.

In review, it often is not enough.

A reviewer can reasonably ask:

What defines “prior to treatment”, RFSTDTC or exposure datetime?
What happens for records collected on the same day as first dose?
What if collection time is missing?
Are unscheduled visits included?
If multiple qualifying values exist, how is “last” decided?
Is the same rule used across LB, VS, EG, and QS?

The data may be perfectly correct. The issue is that the metadata leaves room for more than one interpretation.

Figure 2. Weak metadata vs strong metadata

The difference is not style. It is whether the reviewer has to guess.

What strong metadata looks like in SDTM

A better define.xml statement does not just sound more formal. It removes doubt.

Example of stronger wording

LBLOBXFL is assigned as 'Y' to the chronologically latest non-missing result collected before first exposure. If only dates are available, collection date must be strictly earlier than DM.RFSTDTC. Records on the first-dose date are eligible only when both collection time and dosing time are available and the collection occurs before dosing. Records with missing time on the first-dose date are not eligible. If more than one qualifying records exist, the latest chronological record is selected.

Now the reviewer knows the anchor, the same-day rule, the missing-time rule, and the tie-break rule.

CDISC-style metadata flow

At a practical level, define.xml sits in the middle of a traceability chain. The reviewer should be able to move through that chain without guessing.

Figure 3. Traceability path from collection to reviewer interpretation

A CDISC-style flow showing how collected data becomes reviewer-facing metadata.

XML snippet, weak vs stronger version

One of the best ways to see the problem is in the XML itself. Here is the same SDTM concept shown two different ways.

Weak XML example

Listing 1. Minimal method description for SDTM LBLOBXFL

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1">
  <Description>
    <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText>
  </Description>
  <Origin Type="Derived"/>
  <MethodRef MethodOID="MT.LB.LBLOBXFL"/>
</ItemDef>

<MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag" Type="Computation">
  <Description>
    <TranslatedText xml:lang="en">
      Last non-missing result prior to treatment.
    </TranslatedText>
  </Description>
</MethodDef>

Stronger XML example

Listing 2. Reviewer-friendly method description for SDTM LBLOBXFL

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1">
  <Description>
    <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText>
  </Description>
  <Origin Type="Derived"/>
  <MethodRef MethodOID="MT.LB.LBLOBXFL"/>
</ItemDef>

<MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag Derivation" Type="Computation">
  <Description>
    <TranslatedText xml:lang="en">
      LBLOBXFL is assigned as 'Y' to the chronologically latest non-missing
      result collected before first exposure. If only dates are available,
      collection date must be strictly earlier than DM.RFSTDTC. Records on
      the first-dose date are eligible only when both collection time and
      dosing time are available and the collection occurs before dosing.
      Records with missing time on the first-dose date are not eligible.
      If multiple qualifying records exist, the latest chronological record
      is selected.
    </TranslatedText>
  </Description>
</MethodDef>

Where this usually breaks

From experience, these are the places where weak metadata triggers the most review friction:

Area	Common weak wording	What is missing
Study Day (--DY)	Derived from reference start date	Formula, sign convention, partial date handling
Partial dates	Partial dates were imputed	Method, scope, and where the imputed value is used
Lab standardization	Standard unit	Conversion rule, order of operations, flag impact
Cross-domain rules	Separate domain notes only	Whether the same concept behaves consistently across domains
Traceability	Relationship to study drug	Collected vs assigned vs sponsor-derived logic

Figure 4. Define.xml review checklist

A simple internal test before final package release.

A simple review checklist before submission

Reproducibility, can an experienced programmer recreate the variable using only define.xml?
Ambiguity, does the description allow more than one reasonable interpretation?
Boundary handling, are same-day, missing-time, partial-date, repeated-record, and tie cases clearly defined?
Consistency, is the same concept handled the same way across domains unless an exception is explicitly stated?
Traceability, can a reviewer move from CRF to SDTM to derived variable without guessing?

If any answer is no, the package may still validate cleanly, but it is not fully review-ready.

Final thought

Passing technical validation is necessary.

It is not sufficient.

Define.xml is not just a supporting file. For many reviewers, it is the first real interface to your SDTM data.

If they had only this file, would they understand your submission, or question it?

Suggested closing question for comments

Have you seen define.xml wording that looked fine internally, but triggered avoidable review questions later?

Search This Blog

STUDYSAS BLOG