Gartner estimates that 87% of AI and ML projects never make it to production. Of the ones that do, the majority degrade in performance within 6–12 months as the world they model changes. The data team celebrates the model's launch. The engineering team struggles to maintain it. Users eventually notice it's giving worse results. Slowly, quietly, the AI feature becomes a liability.

This is not primarily a modelling problem. It's an operations problem.

MLOps — Machine Learning Operations — is the discipline of applying software engineering and operations principles to the machine learning lifecycle. It's the difference between an AI experiment and an AI product.

87%

ML projects never reach prod

Gartner 2023

6–12mo

Avg. time to model degradation

Without drift monitoring

~40%

Teams with model monitoring

Even in mature organisations

Faster time to production

With mature MLOps platform

Why AI Fails in Production

The failure modes cluster into five patterns:

1. Training-Serving Skew

The model was trained on data processed one way. In production, the same data is processed slightly differently — different preprocessing library version, different handling of missing values, different categorical encoding. The model receives slightly different inputs than it trained on, and performance degrades in ways that are difficult to attribute.

This is the most common and most insidious failure mode. A 2% difference in feature computation compounds across thousands of features and millions of predictions.

The fix: A feature store that serves identical feature computations at training time and serving time. Same code, same library versions, same business logic.

2. Data Drift

The statistical distribution of production data shifts away from the training distribution. A fraud detection model trained on 2023 transaction patterns encounters 2025 fraud patterns it's never seen. A product recommendation model trained in summer receives winter queries.

The fix: Continuous monitoring of input feature distributions against training baseline. Alert when drift exceeds thresholds. Automate retraining triggers.

3. Concept Drift

The relationship between inputs and correct outputs changes — the real world shifts, but the model doesn't. A customer churn model might have learned that customers who reduce login frequency are high-churn risk. After a UX redesign that reduces login frequency across all users, this signal is meaningless.

The fix: Continuous monitoring of model performance against ground truth labels. Requires a pipeline to retrieve delayed ground truth and compute model accuracy/precision/recall against historical predictions.

4. No Reproducibility

When something goes wrong, you can't reproduce what happened. The training run that produced the production model was in a notebook, the data version wasn't recorded, the hyperparameters weren't logged, and the engineer who ran it has since left.

The fix: Experiment tracking (MLflow, Weights & Biases) that automatically logs code version, data version, hyperparameters, and all metrics. Reproducibility is a system property, not a discipline.

5. Manual, Undocumented Deployment

The model was deployed by a data scientist running a script from their laptop. The deployment process isn't documented. Nobody else can deploy it. When it needs to be updated, you need that specific person available.

The fix: CI/CD for ML models. Every model deployment goes through a pipeline: validate, test, register, deploy. The deployment process is code, reviewed like any other code.

The MLOps Maturity Model

MLOps capability develops in three stages. Understanding where you are helps you know what to invest in next.

Level 0 — Manual ProcessAd hoc, no automation

Most teams start here. Models are developed in notebooks, trained manually, deployed by a person running a script. No experiment tracking, no model registry, no monitoring. Every deployment is a heroic effort.

→Notebooks as the primary development environment
→Manual data preprocessing with no versioning
→Model deployment is a script or manual upload
→No monitoring post-deployment
→Retraining requires a data scientist to manually run a job

Level 1 — Pipeline AutomationAutomated training, manual deployment

Training is automated in reproducible pipelines. Models are registered and versioned. Data is versioned. Deployment is still semi-manual but goes through a registry. Basic performance monitoring is in place.

→Automated training pipeline triggered on schedule or data refresh
→Experiment tracking for all training runs
→Model registry with promotion workflow
→Automated model validation before registration
→Basic serving infrastructure (no manual laptop deployments)

Level 2 — CI/CD for MLFully automated end-to-end

The gold standard. A code change triggers automated retraining, evaluation, and deployment (with human approval gates for production). Drift monitoring triggers automated retraining. The system maintains itself.

→Code change triggers full CI/CD pipeline: train, evaluate, deploy
→Continuous monitoring with automated retraining triggers
→A/B testing and canary deployments as standard practice
→Shadow mode for new models before full rollout
→Feature store shared between training and serving

Most organisations should target Level 1 within their first 6 months and Level 2 within 12–18 months of serious ML investment.

The MLOps Stack

MLOps Platform Components

📊

Data & Features

Feature store
Data versioning (DVC)
Data validation (Great Expectations)
ETL orchestration

🔬

Experimentation

MLflow / W&B
Model registry
Hyperparameter optimisation
GPU compute management

🚀

Deployment

Azure ML Endpoints
CI/CD pipeline
Container registry
Traffic splitting / canary

📈

Monitoring

Evidently / Arize AI
Data drift detection
Model performance tracking
Retraining triggers

Implementing CI/CD for ML on Azure

Azure ML provides native support for MLOps pipelines. Here's the reference architecture:

GitHub Actions workflow:

Trigger: Push to main (code change) or scheduled (weekly retraining)
Data validation: Run data quality checks using Great Expectations; fail if data quality is below threshold
Training pipeline: Submit Azure ML Pipeline job (parameterised training run with DVC data version)
Evaluation: Compare new model metrics against production model metrics from Model Registry
Registration gate: If new model outperforms current production model on holdout metrics, register as Staging candidate
Approval: Required human approval before promoting Staging model to Production
Deployment: Update Azure ML Managed Endpoint to route traffic to new model version (canary or direct)
Smoke test: Run automated smoke tests against the new endpoint before full traffic cutover

# .github/workflows/ml-pipeline.yml
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1'  # Weekly retraining trigger

jobs:
  train-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Validate data quality
        run: python src/validate_data.py
      - name: Submit training job
        run: az ml job create --file training_job.yml
      - name: Evaluate model
        run: python src/evaluate_model.py --compare-production
      - name: Register model if improved
        run: python src/register_model.py --stage staging

Model Monitoring in Practice

What to Monitor

Infrastructure metrics (same as any service):

Request latency (P50, P95, P99)
Throughput (requests per second)
Error rate (failed predictions)
Resource utilisation (CPU, memory, GPU if applicable)

Model-specific metrics:

Prediction distribution (are output probabilities shifting?)
Input feature distributions (data drift vs. training baseline)
Model performance metrics — if ground truth is available, compute accuracy/precision/recall on recent predictions
Business metrics (the downstream KPI the model is supposed to move)

The Ground Truth Problem

The biggest challenge in model monitoring is that ground truth (the correct answer) is often delayed or unavailable. A churn prediction model knows which customers it predicted would churn, but won't know if it was right for 30–90 days.

Design for delayed ground truth retrieval from day one: when you make a prediction, log it with the input features, prediction output, and confidence score. Create a scheduled job that retrieves ground truth labels after the known delay and computes accuracy metrics against stored predictions.

⚠️

Don't monitor only infrastructure metrics

The most common MLOps monitoring mistake: teams instrument infrastructure metrics (latency, throughput, errors) but not model quality metrics. A model can be returning predictions with 100% uptime and zero errors while its accuracy is 40% worse than when it launched. Without monitoring model quality, you won't know until a user complains.

The Organisational Challenge

MLOps requires collaboration between roles that often have different priorities: data scientists who want to move fast and experiment freely, engineers who want reproducibility and operational stability, and compliance teams who want audit trails and explainability.

The most successful MLOps implementations I've seen have:

Shared ownership — data scientists are responsible for the operational health of their models, not just their accuracy
Platform team — a small team (1–3 engineers) that owns the MLOps platform and reduces friction for data scientists
Defined handoff process — a model is "ready for production" when it meets defined criteria, not when a data scientist declares it done
Documented incident process — model degradation is an incident, treated with the same severity as a service outage

AI platform development and MLOps architecture are core to my practice. If you're trying to bridge the gap between your data science experiments and production-grade AI systems, let's talk.

MLOps: Why Your AI Project Fails in Production (And How to Fix It)