This is one of the most common transitions in ML engineering:
- a model works in Jupyter
- the notebook produces promising metrics
- the team now needs to make it real
That is where many projects slow down.
The notebook was useful for exploration. It was never designed to be a production artifact. Cells run out of order. dependencies are implicit. input data lives in local paths. model logic, data cleanup, evaluation, and visualization all sit in one document.
If you want to move from notebook to production ml, you need a path that is narrower and more practical than “build a full ML platform first.”
This guide is that path.
We will walk through the most common sequence for productionize jupyter notebook workflows:
- separate notebook logic into reusable code
- lock dependencies and package the model
- add tests
- containerize the service
- create CI/CD
- deploy
- monitor in production
The examples are intentionally simple, because most teams do not fail at this stage due to missing sophistication. They fail because the gap between notebook work and operational work is wider than expected.
What Usually Lives Inside the Notebook
A typical notebook doing fraud scoring, churn prediction, or recommendations often contains all of this at once:
- data loading
- cleaning and feature engineering
- train/validation split
- model training
- ad hoc evaluation
- plots
- a quick
predict()example
That is normal for exploration. It becomes a problem when people try to deploy the notebook logic directly.
The production goal is not to turn the notebook itself into a service. The goal is to extract the useful parts and put them into a structure the rest of the system can trust.
Step 1: Split Exploration From Production Logic
The first move is structural, not infrastructural.
Take the notebook and identify four kinds of code:
- exploratory analysis
- feature preparation
- training logic
- inference logic
Only the last three should survive into production code.
A minimal project layout might look like this:
project/
notebooks/
fraud_model_exploration.ipynb
src/
features.py
train.py
predict.py
service.py
models/
tests/
test_features.py
test_predict.py
requirements.txt
Dockerfile
For example, notebook code like this:
df["amount_log"] = np.log1p(df["amount"])
df["is_high_risk_country"] = df["country"].isin(HIGH_RISK).astype(int)
model = xgb.XGBClassifier(max_depth=6, n_estimators=200)
model.fit(X_train, y_train)
score = model.predict_proba(X_test)[:, 1]
should move into functions with explicit inputs and outputs:
# src/features.py
import numpy as np
HIGH_RISK = {"RU", "KP", "IR"}
def build_features(df):
frame = df.copy()
frame["amount_log"] = np.log1p(frame["amount"])
frame["is_high_risk_country"] = frame["country"].isin(HIGH_RISK).astype(int)
return frame
# src/train.py
import joblib
import xgboost as xgb
from src.features import build_features
def train_model(train_df, feature_columns, output_path):
prepared = build_features(train_df)
X = prepared[feature_columns]
y = prepared["label"]
model = xgb.XGBClassifier(max_depth=6, n_estimators=200)
model.fit(X, y)
joblib.dump(model, output_path)
This does two things immediately:
- you remove cell-order dependency
- you make training and inference code callable from tests, CI, and services
That is the first real step in ml notebook to pipeline.
Step 2: Make Inference Reproducible
A lot of notebook projects assume inference is just:
- load model
- call
predict
In reality, inference depends on exactly how features were built during training.
If feature engineering is implicit in the notebook, the service will drift from training behavior quickly.
Create a small inference module that owns:
- input validation
- feature building
- model loading
- prediction formatting
# src/predict.py
import joblib
import pandas as pd
from src.features import build_features
MODEL_PATH = "models/fraud_model.joblib"
FEATURE_COLUMNS = ["amount_log", "is_high_risk_country", "account_age_days"]
model = joblib.load(MODEL_PATH)
def predict_one(payload: dict) -> dict:
frame = pd.DataFrame([payload])
prepared = build_features(frame)
score = float(model.predict_proba(prepared[FEATURE_COLUMNS])[0, 1])
label = "review" if score > 0.75 else "approve"
return {"score": score, "decision": label}
Now the model behavior is available outside the notebook in a deterministic place.
That sounds basic. It is also the difference between:
- “we have a notebook”
and
- “we have an inference artifact”
Step 3: Pin Dependencies and Package the Artifact
Notebook environments often survive on accidental reproducibility:
- a local Python version
- ad hoc
pip install - libraries upgraded sometime last week
That does not survive deployment.
At minimum, freeze the serving environment:
fastapi==0.116.0
uvicorn==0.35.0
pandas==2.3.0
numpy==2.2.5
scikit-learn==1.7.0
xgboost==3.0.1
joblib==1.5.0
pytest==8.4.0
Also treat the trained model file as a versioned artifact, not a random file on a laptop.
Good early-stage options include:
- object storage bucket with versioned paths
- model artifacts committed to a registry-like directory
- CI-generated artifact upload
The key is simple: the service should know exactly which model version it is loading.
Step 4: Add Tests Before You Add Kubernetes
Many teams skip straight from notebook cleanup to deployment work. That is backwards.
Before infrastructure, add tests around the behavior you are about to operationalize.
You do not need fifty tests on day one. You do need enough to stop obvious regressions.
Start with:
- feature-building tests
- shape and schema tests
- inference smoke tests
- edge case tests
Example:
# tests/test_features.py
import pandas as pd
from src.features import build_features
def test_build_features_adds_expected_columns():
df = pd.DataFrame([
{"amount": 120.0, "country": "US", "account_age_days": 45}
])
result = build_features(df)
assert "amount_log" in result.columns
assert "is_high_risk_country" in result.columns
assert result.loc[0, "is_high_risk_country"] == 0
# tests/test_predict.py
from src.predict import predict_one
def test_predict_one_returns_expected_shape():
result = predict_one({
"amount": 120.0,
"country": "US",
"account_age_days": 45
})
assert "score" in result
assert "decision" in result
assert 0.0 <= result["score"] <= 1.0
These tests are not glamorous. They are what lets you refactor notebook code without breaking the only production path you have.
If the model depends on feature order, null handling, or type coercion, capture those assumptions in tests immediately.
Step 5: Wrap the Model in a Minimal Service
Once inference is reliable, expose it behind a narrow service interface.
For many teams, a small FastAPI service is enough:
# src/service.py
from fastapi import FastAPI
from pydantic import BaseModel
from src.predict import predict_one
app = FastAPI()
class PredictionRequest(BaseModel):
amount: float
country: str
account_age_days: int
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict")
def predict(request: PredictionRequest):
return predict_one(request.model_dump())
This is where the notebook becomes an interface the rest of the company can use.
Notice what is not here:
- notebook-specific code
- plots
- training logic
- local file assumptions in the request path
That separation is what makes deployment and monitoring possible.
Step 6: Containerize It
Now you have something worth packaging.
A minimal Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src ./src
COPY models ./models
EXPOSE 8000
CMD ["uvicorn", "src.service:app", "--host", "0.0.0.0", "--port", "8000"]
This step matters because it makes the runtime portable and consistent between:
- developer laptop
- CI
- staging
- production
If the service only works on the notebook author’s machine, the project is not one step from production. It is still in exploration.
Step 7: Add CI Before Fancy CD
You do not need a giant deployment platform to start.
You do need automation that proves:
- the code installs
- tests pass
- the image builds
A lightweight GitHub Actions example:
name: notebook-to-production
on:
push:
branches: [main]
pull_request:
jobs:
test-and-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run tests
run: pytest tests -v
- name: Build container
run: docker build -t registry.example.com/fraud-model:${{ github.sha }} .
That alone moves the team from manual heroics to a repeatable build path.
Later you can add:
- model evaluation gates
- artifact publishing
- canary rollout logic
- deployment promotion
But early on, the biggest win is usually just removing “works on my notebook” from the release process.
Step 8: Deploy the Smallest Thing That Can Be Operated
For many internal or early product workloads, the first production deployment does not need a complex stack.
A practical Kubernetes deployment might look like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model-api
spec:
replicas: 2
selector:
matchLabels:
app: fraud-model-api
template:
metadata:
labels:
app: fraud-model-api
spec:
containers:
- name: api
image: registry.example.com/fraud-model:sha-12345
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: fraud-model-api
spec:
selector:
app: fraud-model-api
ports:
- port: 80
targetPort: 8000
A few important points:
- use readiness probes so broken pods do not receive traffic
- set memory limits so bad loads do not destabilize the node
- start with two replicas if the service matters
The lesson here is not “everyone needs Kubernetes.” The lesson is that deployment needs health checks, explicit resources, and reproducible images.
If a simpler platform-as-a-service is enough, use that. The production requirement is operational reliability, not architectural theater.
Before You Call It Production: Add a Release Checklist
One of the easiest ways to lose trust after moving off notebooks is to deploy a service that technically runs but is operationally unsafe.
Before each release, validate:
- the image starts cleanly
- the model loads successfully
- a sample prediction request works
- the model version is visible in logs or metrics
- rollback is possible without rebuilding by hand
A simple smoke check can already prevent a lot of bad deploys:
curl -X POST http://staging.example.com/predict \
-H "Content-Type: application/json" \
-d '{
"amount": 120.0,
"country": "US",
"account_age_days": 45
}'
You are looking for more than a 200 response.
You also want to confirm:
- output shape matches expectations
- latency is reasonable
- logs show the correct model version
- no unexpected warnings appear on startup
At this stage, release discipline matters more than elegant deployment theory. A boring checklist is usually worth more than another architecture diagram.
Turn the Notebook Training Flow Into a Repeatable Job
Many teams successfully move inference into a service and still leave training as:
- open notebook
- rerun cells
- export artifact manually
That is a temporary compromise at best.
If retraining matters, make it callable from the command line so CI or a scheduler can run it later.
For example:
# src/train.py
import argparse
import pandas as pd
from pathlib import Path
from src.features import build_features
from sklearn.ensemble import RandomForestClassifier
import joblib
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True)
parser.add_argument("--output", required=True)
args = parser.parse_args()
df = pd.read_parquet(args.input)
prepared = build_features(df)
feature_columns = ["amount_log", "is_high_risk_country", "account_age_days"]
X = prepared[feature_columns]
y = prepared["label"]
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X, y)
Path(args.output).parent.mkdir(parents=True, exist_ok=True)
joblib.dump(model, args.output)
if __name__ == "__main__":
main()
Now your notebook can still exist for exploration, but the real training path becomes:
python -m src.train --input data/train.parquet --output models/fraud_model.joblib
That matters because production pipelines need a runnable training entrypoint, not a person remembering which notebook cells to execute and in what order.
Later, this same command can be triggered by:
- GitHub Actions
- Airflow
- Argo Workflows
- a scheduled job in your platform
This is how notebook experimentation starts becoming a pipeline instead of a ritual.
Step 9: Add Model-Aware Monitoring
Many teams believe they are done once the service returns predictions. That is only the halfway point.
Once the model is live, you need visibility into:
- request volume
- error rate
- inference latency
- prediction distribution
- model version
Example Prometheus instrumentation:
from prometheus_client import Counter, Histogram
PREDICTION_REQUESTS = Counter(
"ml_prediction_requests_total",
"Total prediction requests",
["model_version", "status"]
)
PREDICTION_LATENCY = Histogram(
"ml_prediction_latency_seconds",
"Prediction latency in seconds",
["model_version"]
)
This is how you catch:
- silent regressions
- bad rollouts
- odd shifts in prediction output
- infrastructure problems that look like model problems
The first monitoring pass does not need to solve every quality problem. It does need to tell you whether the system is healthy and whether a deploy made things worse.
If you want one more practical starting point, log the model version and request ID on every prediction response path. During incidents, that single habit often saves more time than a more complicated observability stack that nobody wired up properly.
Step 10: Keep the Notebook, But Demote It
This part is important organizationally.
You do not need to delete notebooks.
Keep them for:
- exploration
- analysis
- visual debugging
- new feature experiments
But demote them from “source of truth” to “research workspace.”
The production source of truth should now be:
- code in
src/ - tests in
tests/ - artifacts in versioned storage
- deployment config in the repo
- monitoring in the platform
That shift is what turns an ML effort from personal experimentation into team-operable software.
Common Mistakes in the Notebook-to-Production Transition
Shipping notebook code with hidden global state
Cells often depend on variables defined much earlier. Once refactored into modules, those assumptions break. Make every function explicit about its inputs.
Mixing training and inference code paths
Production services should not contain ad hoc training logic. Keep training separate from online inference.
Recreating feature logic by hand in the service
If the training notebook built features one way and the service rebuilds them differently, prediction quality will degrade even when everything “works.”
No tests around preprocessing
Most production failures are not exotic model failures. They are data-shape and preprocessing failures.
Deploying without model version visibility
If you cannot see which model version is serving traffic, rollback and incident analysis get harder immediately.
A Practical Maturity Path
If you are starting from a notebook today, the sequence should usually be:
- extract reusable code from the notebook
- create a deterministic inference module
- pin dependencies and version the artifact
- add tests
- wrap it in a small service
- containerize it
- add CI
- deploy with health checks
- monitor requests, latency, and outputs
Only after that should you worry about:
- advanced orchestration
- automated retraining
- canary release automation
- internal platform abstractions
Teams that skip the practical basics usually spend longer on platform talk than on getting the first model operating safely.
Final Takeaway
The path from productionize jupyter notebook to a real service is not mysterious. It is mostly about discipline.
A notebook is where you learn.
A production pipeline is where you make that learning repeatable, testable, deployable, and observable.
If you want to move from notebook to production ml, do not start with a platform rewrite. Start by extracting the real logic, packaging it properly, testing it, and shipping the smallest reliable service you can operate.
That is the practical path. It is also the path most teams should take first.