Who does what?
Mapping the modelling task on to the team
Problem statement
The key coordination problem is how to divide responsibility between the data engineer and the model engineer when defining the prediction task. The data engineer will extract and shape hospital EHR data into reliable, versioned tables of antibiotic prescribing and microbiology outcomes, while the model engineer will build a simple first model (for example, logistic regression). The hardest work sits between them: choosing the index event that anchors each prediction, defining the label window and prediction horizon, and specifying which features are allowed before the cut-off time so that leakage is avoided. We will use the language from Label–Segment–Featurize, and implemented this in tested, modular transformations as per the MLOps philosophy. The dataset, features and labels are reproducible across training an serving environments, and the resulting model can be moved back into NHS systems after development offline in a Trusted Research Environment.
We have four roles in the RADIX team.
Design principles
MLOps is the discipline of turning a machine learning model into a reliable, auditable, repeatable service. In practice it means: versioning data, labels, features, code, and models; automating training/evaluation; enforcing quality gates (tests, audits, monitoring); and making deployment and rollback safe.(Corbin 2023) For this project, MLOps implies an end-to-end pipeline that can be run deterministically in both NHS and research environments, with clear interfaces between data preparation, feature generation, model training, and serving—so that a “minimal working model” can be promoted, monitored for drift/performance, and iterated without breaking clinical workflow or information governance.
Label–Segment–Featurize (L-S-F): A useful design pattern for “prediction engineering” in time-stamped relational data.(Kanter 2016) The key principle is to explicitly define (1) the label and its time window, (2) the cutoff time at which the prediction is made and the lead/lag that define what historical data is allowed, and (3) the feature extraction restricted to the allowable segment to prevent label leakage. Applied here, L-S-F forces agreement on what the “prescribing event” is (the anchor), how resistance/mismatch is determined relative to that anchor (label window), and what prior history is admissible as predictors (feature window).

High level requirements
Clear ownership: each deliverable has a single accountable role.
Reproducibility: datasets and models are versioned and re-runnable.
No leakage: features must only use information available at prediction time.
IG by design: identifiable work stays on NHS infrastructure; research work stays in the DSH.
Deployment path: the trained model can be applied in NHS or HSL infrastructure with the same feature contract.
Implementation
Overview
The data engineer owns the data product (event definitions, labels, features, dataset versioning).
The model engineer owns the modelling product (training/evaluation code, leakage control, packaging for deployment).
The clinical lead owns clinical validity & safety (definitions, exclusions, outcome validity, evaluation framing).
The infrastructure engineer owns platform & delivery mechanics (environments, CI/CD, orchestration, security boundaries).
Data versus Model engineering
The rule of thumb is that data features are facts, and model features are representations of those facts.
Data features are time-stamped, clinically interpretable, reproducible facts at the right time granularity ("as-of" tpresciption ), with stable definitions and data-quality tests. If you can say “a clinician could recognise this field as a real-world thing” → it’s probably a data feature.
Model features decide how such facts are encoded and transformed for learning and inference (normalisation, interactions, binning, embeddings, missingness strategy, calibration-oriented transforms), and own the code that makes training/inference identical. If you can say “this is a transformation to help the model” → it’s probably a model feature.
This means that data engineering will build a data model with one row per prediction event (e.g. antibiotic start time). That model will
present columns of atomic facts with clear "as-of" tpresciption semantics, without leakage, with clear units, and stable semantic meaning and interpretation
this clinical ML feature store with one row per prediction event is likely to be built from either Camino (our modular pipeline that maintains tables based on their underlying concepts derived in turn from the UCLH enterprise data warehouse), or the HSL Radix FHIR store.
And model engineering will build additional columns within the model pipeline but would not build fresh tables or re-organise the "as-of" tpresciption data structure.
Heuristics for data versus model engineering decisions
Would you want this displayed in a clinical dashboard? For example, who builds the "Number of antibiotic courses in last 90 days" feature (anchored at "as-of" tpresciption )? Here you'd argue this would be of interest to a clinical consumer so this is a data feature. Conversely, normalising or log transform the same number would be a model feature.
Does the feature require fitting parameters?
Model features: StandardScaler mean/SD, target encoding priors, PCA loadings, spline knots, calibration maps
Data features: Simple deterministic aggregations (“count”, “max”, “days since”)
Is the feature "semantic" or "algorithmic"? This is similar to (1) above.
Semantic feature (has meaning independent of algorithm): comorbidity count, prior resistant isolate, creatinine latest, ward, age → data features
Model features: Algorithmic feature (meaning depends on algorithm): one-hot encoding choices, missingness indicators, interaction terms, monotonic binning, learned embeddings → model features
A promotion rule - from model to data feature Moving a transformation upstream from model to data only if:
it improves multiple models/use-cases,
it’s deterministic and stable,
it has clinical meaning or operational value,
it can be tested with audits,
it doesn’t embed training-set statistics.
Default to a data feature if you're not sure
Summary
Effectively we are building two products - a clinical feature store and a many models and their design matrices
Clinical ML Feature Store (facts)
Relationships
Owned by the data engineer
Reviewed by the clinical lead
Consumed by the model engineer
Layers
Bronze: raw extracts from Epic Clarity/Caboodle mapped to standard fields
Silver: Camino - cleaned, deduplicated, time-aligned tables (meds, cultures, encounters)
Gold: the event-anchored fact table: features and labels "as-of" tpresciption
Never "saved" because it represents current truth
Model Design Matrix (representations)
Relationships
Owned by the model engineer
Reviewed by the clinical lead
Turns features "as-of" tpresciption into
encoded matrix \(X\) (categorical encoding, scaling)
derived terms (interactions, non-linear transforms)
missingness handling as implemented logic
Saved as a versioned artefacts within MLflow or similar
Examples
Data features
age_at_event, sex
drug_name, route, dose, indication_code (as recorded)
organism_last_12m: last cultured organism category (if known before \(t_c\))
prior_resistance_to_drug_2y: count of prior non-susceptible results
days_since_last_antibiotic
num_hospital_admissions_1y
latest_creatinine_value_before_tc + timestamp_of_creatinine
Model features
One-hot / target encoding of drug_name, organism_last_12m
missing_creatinine_indicator
log1p(num_hospital_admissions_1y)
Interaction terms: drug_class × prior_resistance
Binning: age_group chosen for performance/robustness
Calibration mapping and threshold selection
Borderline cases
age_group
If it’s for reporting/clinical interpretability and fixed upfront → a data feature
If it’s tuned/changed during modelling → a model feature
Last updated
