- Learnwithdevopsengineer
- Posts
- ⚡Model Drift & Monitoring — Catching AI Failures Before Users Do (EP6)
⚡Model Drift & Monitoring — Catching AI Failures Before Users Do (EP6)
MLOps Series — How Real Companies Watch Their Models in Production
🎯 Why This Episode Matters
In software, when something breaks… it usually breaks loudly.
In machine learning, models often fail quietly:
No errors
No exceptions
No crashes
Just silently wrong predictions.
The model you deployed last month was great.
Today, users are typing new slang, new patterns, new behaviors…
and your model has no idea what they’re talking about.
That’s data drift and concept drift.
If you’re not monitoring it, your “AI system” slowly becomes useless while dashboards stay green.
Episode 6 is all about the missing layer:
👉 Monitoring & Drift Detection for ML models in production.
We’ll build a real setup that:
tracks requests and predictions
detects out-of-distribution inputs (slang / new patterns)
shows live metrics in Prometheus
visualizes drift in Grafana dashboards
This is how real companies keep ML systems trustworthy after deployment.
📌 What We Build in Episode 6
Our repo now has a proper monitoring stack:
mlops_ep6_monitoring/
artifacts_prod/ # Production model
pipeline.pkl
model/
train_good_model.py # Train stable production model
api_monitoring.py # FastAPI with metrics + drift detection
data/
data.csv # Sample training/serving data
prometheus/
prometheus.yml # Scrape config for FastAPI metrics
grafana/
provisioning/ # Auto-configure Prometheus datasource
Dockerfile # Build monitored API image
docker-compose.yml # API + Prometheus + Grafana stack
requirements.txt
In this episode, we:
train a GOOD production model
wrap it in a FastAPI microservice
instrument it with Prometheus metrics
visualize everything in Grafana
simulate drift using unseen slang and new patterns
By the end, you’ll have a production-style ML monitoring setup running on your machine.
🟢 Training the Stable Production Model
We start with a clean, reliable text-classification pipeline.
The script:
python model/train_good_model.py
Trains a model and saves it to:
artifacts_prod/pipeline.pkl
This is our trusted production model.
The FastAPI service always loads this file
Prometheus + Grafana observe everything it does
Any future model must beat or at least match this one
Think of it as:
🧠 “the brain currently running in production.”
📊 Turning FastAPI into a Monitored ML Microservice
Next, we upgrade our API into a fully observable ML service.
api_monitoring.py exposes:
/predict— for real predictions/metrics— for Prometheus scraping
Inside, we track:
Total requests (how much traffic your model receives)
Predictions per class (are we suddenly predicting one class 90% of the time?)
Input text length histograms (user behavior changing?)
Model version as a metric (which model is live)
Out-of-Distribution (OOD) inputs based on slang / unseen patterns
Example OOD idea:
def is_out_of_distribution(text: str) -> bool:
slang = ["scene out", "wifi kaput", "5g gone", "rip net"]
...
It’s intentionally simple — but it mimics how real teams add signals for new behavior.
This is not just “serving a model”.
This is instrumenting a model.
🐳 Running the Full Stack: API + Prometheus + Grafana
We don’t run services manually one by one.
We run them like a real platform would:
docker compose up --build
This spins up:
FastAPI (monitored ML microservice)
Prometheus (metrics database + query engine)
Grafana (dashboards)
One command → complete MLOps monitoring environment.
📡 Prometheus: Watching Metrics in Real Time
Open:
http://localhost:9090
Query metrics like:
ml_requests_total— overall trafficml_pred_network_total,ml_pred_billing_total, etc.ml_input_in_distribution_totalml_input_out_of_distribution_totalml_text_length_bucket
Then:
Send normal inputs to the API
Send slang / weird inputs
Refresh your queries
You’ll see the OOD counters jump.
That’s live drift detection.
📈 Grafana: Visualizing Drift & Behavior
Next, open Grafana:
http://localhost:3000
login: admin
We auto-provision Prometheus as a datasource, so you can start creating dashboards immediately.
Typical panels we build:
Out-of-Distribution Rate
Prediction Distribution
Request Traffic
Input Length Behavior
Model Version
With just a few panels, you can answer:
“Are users behaving differently than last week?”
“Did predictions shift heavily toward one class?”
“Are we getting more OOD traffic?”
“Which model version is currently live?”
This is real observability for ML, not just logging.
🧠 What EP6 Teaches You
Key idea:
CI/CD protects deployments.
Monitoring protects everything after deployment.
Episode 6 gives you:
the difference between “serving a model” and monitoring a model
how to expose ML metrics from FastAPI
how to design OOD / drift signals
how to connect FastAPI → Prometheus → Grafana
how to build a drift dashboard in under 20 minutes
how real teams notice model failures before customers do
If you want to call yourself an MLOps Engineer, this is core skillset.
🚀 Coming Up in Episode 7
Episode 7 connects all the pieces:
monitoring detects drift
drift triggers retraining
CI/CD evaluates & auto-rejects bad models
only better models get promoted
End goal:
👉 A self-updating ML system
that:
watches itself
retrains when needed
tests new models
auto-promotes only when safe
This is what “real-world MLOps” looks like.
🔗 Full Video + Code Access
🎥 Watch Episode 6:
https://youtu.be/GQj0S2bHc68
📬 Code + Labs + Exercises:
https://learnwithdevopsengineer.beehiiv.com/subscribe
Subscribers get:
full FastAPI + Prometheus + Grafana code
monitoring & drift detection labs
CI/CD + governance examples from EP5
“real incident” simulation scripts
interview questions for MLOps & DevOps roles
all episode bundles in one place
💼 Need DevOps or MLOps Help?
If you’re building:
CI/CD pipelines for ML or microservices
Docker + Jenkins / GitHub Actions setups
MLflow / experiment tracking
FastAPI model deployments
monitoring + alerting (Prometheus / Grafana)
Kubernetes or scalable infra for ML
cost-optimized cloud environments
You can reach out and work with me directly.
Reply to this email or message me on YouTube / Instagram.
— Arbaz
📺 YouTube: Learn with DevOps Engineer
📬 Newsletter: learnwithdevopsengineer.beehiiv.com/subscribe
📸 Instagram: instagram.com/learnwithdevopsengineer