Quick reference

Data versus Model engineering

Layer
Summary
Rule of thumb
Owner
Output

Facts

Recorded or deterministically derived real-world facts at the prediction grain, time-safe "as-of" tpt_{p} where p is the prediction time

If a domain expert can recognise it as a thing that happened / was measured / was true → it belongs here

Data Engineering (with Domain Oversight sign-off)

⁠features_at_tc (event-grain table) + short data dictionary

Representation

Mathematical representations of those facts used by an algorithm (encodings, scaling, interactions, learned transforms)

If it’s a trick to help the model or depends on training data statistics → it belongs here

Model Engineering (with Domain Oversight sign-off)

⁠preprocessor + model artefact (versioned) + input column list

Outcome

Labels/outcomes computed after "as-of" tpt_{p}, plus censoring/availability flags

If it answers “what happened after?” → it’s an outcome, not a feature

Data Engineering (with Domain Oversight sign-off)

⁠labels_at_tc (keyed by ⁠event_id)

Contract

The small set of agreed fields that must be stable across training and inference

If changing it would break someone else’s code → it’s part of the contract

Joint: Data + Model Engineering (Domain approves meaning)

⁠contract_version + schema file (very small)

Enforcement

Tests and checks that keep the contract true over time (uniqueness, not-null, ranges, leakage sentinels)

If it can silently corrupt results → it needs an automated check

Platform/Infrastructure (implemented by DE/ME as relevant)

CI checks + scheduled audits + “fail fast” schema validation

Last updated