The gap between a data scientist's Jupyter notebook and a production AI platform is one of the most consequential jumps in enterprise technology. Get it right and you have a competitive capability that compounds. Get it wrong and you have an expensive collection of ML experiments that never reach users.
Most organisations start with the notebook. A data scientist demonstrates something impressive — a churn prediction model, a document classifier, an anomaly detector. Leadership gets excited. "Let's put this in production." And then the real work begins.
Production AI requires answers to questions the notebook never asked: How is the model served? How is it monitored? What happens when predictions degrade? How do you retrain without downtime? Who can deploy models, and how? How do you explain a prediction to a regulator?
These are platform problems, not model problems.
The Four Layers of an AI Platform
A production AI platform has four distinct layers, each with its own concerns and tooling.
- ✓Feature store
- ✓Data versioning
- ✓ETL pipelines
- ✓Data quality monitoring
- ✓Lineage tracking
- ✓Experiment tracking
- ✓Model registry
- ✓Code + data + config versioning
- ✓Compute management
- ✓Collaboration tools
- ✓Online inference API
- ✓Batch inference pipeline
- ✓A/B testing framework
- ✓Canary deployments
- ✓SLA monitoring
- ✓Data drift detection
- ✓Model performance tracking
- ✓Prediction logging
- ✓Retraining triggers
Each layer must be designed and operated independently.
Layer 1: Data — The Foundation Everything Else Rests On
The most common reason AI projects fail in production is data quality and consistency issues that only manifest at production scale or with production data distributions. The data layer must be treated as first-class infrastructure, not a preprocessing step.
Feature Store
A feature store solves the most pervasive problem in production ML: training-serving skew. Data scientists compute features one way during training; engineers recompute them differently for production serving. The resulting inconsistency silently degrades model performance in ways that are hard to debug.
A feature store provides:
- A central repository of computed features, shared between training and serving
- Online store for low-latency feature serving (Redis, Cassandra)
- Offline store for batch training (data warehouse, Delta Lake, Parquet)
- Point-in-time correct feature retrieval to prevent label leakage during training
Recommended: Feast (open-source), Azure ML Feature Store, Tecton (managed), Hopsworks
Data Versioning and Lineage
You must be able to reproduce any model from any point in time. This requires versioning not just your code, but your training data.
DVC (Data Version Control) works like Git for data: version large datasets stored in Azure Blob Storage, track exactly which data version produced which model, and reproduce any historical training run.
Data lineage tracks the transformation chain from raw data to training features. When you discover a data quality issue, lineage tells you which models are affected.
Layer 2: Experiment Tracking and Model Registry
Experiment Tracking
Every training run should automatically log:
- Code version (git commit hash)
- Data version
- Hyperparameters
- Training metrics (loss, accuracy, F1, AUC)
- Evaluation metrics on holdout set
- Model artifacts
Without experiment tracking, ML research is a folder of notebooks named model_v3_final_FINAL.ipynb with no way to reproduce results or understand what changed between experiments.
Recommended: MLflow (open-source, Azure ML native integration), Weights & Biases (managed, excellent UX), Comet ML
Model Registry
The model registry is the handoff point between data science (experimentation) and engineering (production serving). It maintains:
- All trained model versions with their associated metrics and lineage
- Model stage (staging, production, archived)
- Approval workflow before promotion to production
- Deployment history
The critical governance rule: no model deploys to production without passing through the model registry with a documented approval. This creates auditability and prevents "just deploy the notebook" shortcuts.
Azure ML Model Registry is the native option on Azure; MLflow Model Registry works well in any environment.
Layer 3: Model Serving
The serving architecture determines your inference latency, throughput, scalability, and cost. Choose based on your use case:
Real-time (online) inference — A synchronous HTTP/gRPC API that returns predictions within milliseconds. Required for customer-facing use cases where predictions are needed in the request path. Deploy on Azure ML Managed Endpoints, Azure Kubernetes Service, or Azure Container Apps.
Near-real-time inference — Asynchronous inference triggered by events (message queue, event stream). Latency in seconds rather than milliseconds, better for throughput-intensive workloads. Deploy on Azure ML Batch Endpoints triggered by Azure Service Bus messages.
Batch inference — Scheduled batch jobs that process large volumes of data offline. Predictions are stored and later used (recommendation precomputation, risk scoring, churn prediction). Cost-effective, as compute can use Spot instances.
Canary Deployments for Models
Model deployments are higher-risk than typical application deployments because a poorly performing model can silently deliver bad predictions at scale. Implement canary deployments:
- Route 5% of traffic to the new model version
- Monitor prediction quality (model metrics, business metrics) for 24–48 hours
- If metrics hold or improve, ramp to 25%, then 100%
- If metrics degrade, roll back to the previous version instantly
Azure ML Managed Endpoints support traffic splitting natively.
Layer 4: Monitoring and Feedback
Deploying a model is not the end of the work — it's the beginning of a monitoring obligation. Models degrade. The data they see in production drifts from the data they were trained on. The world changes and the model doesn't.
Data Drift
Data drift occurs when the statistical distribution of input features in production diverges from the training distribution. A fraud detection model trained on 2022 transaction patterns may perform poorly against 2026 transaction patterns.
Detection approach: Continuously monitor the statistical distribution of input features using tests like Population Stability Index (PSI) or KL Divergence. Alert when drift exceeds a threshold.
Tools: Azure ML Data Drift monitoring, Evidently AI (open-source), Arize AI, Whylogs.
Concept Drift
Concept drift occurs when the relationship between inputs and the correct output changes — the model was correct but the world changed. A product recommendation model trained before a major market shift may recommend the wrong products, even with unchanged input data.
Detection approach: Monitor ground truth labels against model predictions over time. If you have delayed labels (e.g., churn prediction validated 30 days later), implement a pipeline to retrieve ground truth and compute model performance on historical predictions.
Retraining Triggers
Define explicit triggers for model retraining:
- Scheduled: Retrain weekly/monthly regardless of drift
- Drift-triggered: Retrain when data drift exceeds threshold
- Performance-triggered: Retrain when model performance metric drops below threshold
- Event-triggered: Retrain when a major real-world event occurs (market shift, product launch)
Automated retraining pipelines that trigger, train, validate, and deploy (with human approval gate) are the gold standard for mature AI platforms.
Make vs. Buy: Platform Tooling Decisions
| Capability | Build | Buy/Managed |
|---|---|---|
| Feature store | Feast + Redis + Delta Lake | Tecton, Hopsworks |
| Experiment tracking | MLflow self-hosted | W&B, Comet, Azure ML |
| Model serving | KServe on AKS | Azure ML Endpoints |
| Monitoring | Custom + Evidently | Arize, Aporia, Fiddler |
| Orchestration | Airflow on AKS | Azure Data Factory, Prefect Cloud |
The right answer depends on your team's capacity, cloud commitment, and compliance requirements. For most organisations, Azure ML provides a well-integrated managed platform that covers experiment tracking, model registry, and serving — reducing the operational overhead of assembling open-source components.
Start with the minimum viable platform
Don't try to build the full platform before shipping your first model. Start with: (1) MLflow experiment tracking, (2) a model registry with a promotion process, and (3) a serving endpoint with basic monitoring. Ship your first model. Then iterate on the platform based on what's actually painful.
AI and ML platform development is one of my core service areas. If you're building your first production AI system or trying to industrialise an existing experimentation capability, let's talk.