Skip to main content
0%
MLOps

Event-Driven ML Pipelines: When Batch Isn't Fast Enough and Real-Time Is Too Expensive

A deep guide to event-driven ML pipelines as the middle ground between batch and always-on real-time inference, using Kafka or Pulsar for near-real-time scoring without wasting GPU capacity.

12 min read2,387 words

Most ML architecture debates get framed as a false binary.

Either:

  • run batch scoring every few hours and accept staleness

or

  • build a fully online inference stack with strict latency targets and always-on compute

That framing leaves out the architecture many teams actually need.

There is a middle ground between real-time vs batch ml: event-driven pipelines that react to business events, score close enough to the moment of change, and avoid paying for permanently hot inference capacity when the workload does not justify it.

This is where an event driven ml pipeline makes sense.

Instead of serving every prediction as a synchronous request behind an API, you process event streams from systems such as Kafka or Pulsar, run inference asynchronously, and write the output back into operational systems, caches, feature stores, or downstream topics.

That pattern is not universally better. But for many workloads, it is operationally simpler and economically smarter than pretending every model must be a real-time microservice.

The Problem With the Binary

Batch is often too slow for:

  • fraud rules that should react within seconds or minutes
  • inventory and pricing signals that become stale quickly
  • user behavior updates that should influence recommendations in the same session
  • operational alerts that need fast but not instant classification

Real-time serving is often too expensive for:

  • bursty workloads with large idle windows
  • GPU-backed models that are expensive to keep warm
  • moderate-latency use cases that do not truly require sub-100ms responses
  • pipelines where the prediction is not directly blocking a user request

This is why streaming ml inference architecture has become a practical option for many teams. It lets you get closer to real time without inheriting all the cost and complexity of a fully synchronous serving path.

What Event-Driven ML Pipelines Actually Are

In an event-driven pipeline, inference is triggered by domain events rather than direct API calls.

Examples:

  • a new transaction lands on a Kafka topic
  • a customer profile update lands on Pulsar
  • a sensor reading crosses a threshold and emits an event
  • a product-view stream updates user state and triggers re-ranking

The model consumer reads those events, enriches them if necessary, runs inference, and publishes the output somewhere useful.

A simple pattern looks like this:

Application Events
   |
   v
Kafka / Pulsar Topic
   |
   v
Feature Enrichment Consumer
   |
   v
Model Inference Worker Pool
   |
   +----> prediction-results topic
   +----> online cache / database
   +----> alerting / downstream service

The point is not to eliminate latency. The point is to move from:

  • "must respond before the HTTP request returns"

to

  • "must process within an acceptable near-real-time window"

That distinction changes the economics of the system.

When This Model Fits Best

Event-driven inference works best when three conditions are true.

1. The prediction matters soon, but not instantly

If the output is still useful when delivered in a few seconds or a few minutes, you likely do not need synchronous real-time serving.

Examples:

  • transaction risk scoring for post-authorization review
  • recommendation refresh after a user action
  • dynamic lead scoring
  • operations triage classification
  • personalization updates that influence the next page or next email

2. Workload arrival is bursty or uneven

This is where always-on real-time systems get wasteful.

If traffic arrives in spikes, keeping dedicated online model servers hot can mean:

  • idle GPU capacity most of the day
  • overprovisioning for peak
  • poor cost efficiency outside narrow traffic windows

Streaming consumers handle this better because you can scale processing against lag and event throughput rather than pretending every burst needs a full low-latency service footprint.

3. The model output is consumed asynchronously

If another system, queue, cache, or workflow is already involved, you often do not need a blocking prediction call.

That is common in:

  • fraud and abuse review systems
  • recommendation candidate generation
  • document classification
  • personalization state updates
  • demand or anomaly pipelines

Why This Can Be Cheaper Than "Real-Time"

The biggest economic advantage is straightforward: you stop paying for strict latency you do not actually need.

A synchronous online model service usually requires:

  • permanently warm replicas
  • aggressive autoscaling headroom
  • low queue depth tolerance
  • fast dependency paths
  • often, reserved GPU capacity

An event-driven pipeline lets you trade absolute latency for utilization.

That means you can:

  • batch events opportunistically
  • scale on topic lag instead of peak request-per-second guesses
  • let workers process short bursts without exposing every spike to end users
  • schedule GPU-backed consumers only when traffic exists

This is why the architecture is so attractive when "real time" was mostly an inherited assumption rather than a true product requirement.

Kafka and Pulsar as the Backbone

Kafka and Pulsar are both strong fits for this pattern because they give you:

  • durable event streams
  • consumer groups
  • replayability
  • backpressure tolerance
  • decoupling between event producers and model consumers

The choice between them is usually platform-specific.

Kafka is often the default when teams already run:

  • JVM-heavy platforms
  • Kafka Connect
  • stream processing around Flink, Kafka Streams, or ksqlDB

Pulsar is attractive when teams want:

  • stronger multi-tenant isolation
  • topic and subscription flexibility
  • built-in tiered storage patterns
  • geo-replication features

For the ML design itself, the core pattern is the same. Events go in, model workers consume, predictions flow out.

Reference Architecture

A practical reference architecture for near-real-time scoring looks like this:

Order Service / App / Sensor / CRM
              |
              v
       input-events topic
              |
     +--------+---------+
     |                  |
     v                  v
feature-enricher   stream-validator
     |                  |
     +--------+---------+
              |
              v
      model-input topic
              |
      inference-consumer group
              |
     +--------+--------+
     |                 |
     v                 v
predictions topic   serving cache / DB
     |
     v
downstream app / rules engine / analyst queue

This separation matters.

Do not cram:

  • validation
  • feature enrichment
  • inference
  • output writing

into one giant consumer unless the pipeline is truly tiny. Breaking these stages apart gives you better observability, replay control, and operational isolation.

Designing the Topics and Event Contracts

Event-driven ML systems get brittle when teams treat event payloads casually.

At minimum, define:

  • stable event schemas
  • model input versions
  • traceable event IDs
  • timestamps for event creation and processing
  • optional feature completeness markers

Example event schema:

{
  "event_id": "txn_8f3c12",
  "event_type": "card_transaction_created",
  "created_at": "2026-04-10T09:32:18Z",
  "customer_id": "cust_1049",
  "merchant_id": "merch_774",
  "amount": 138.42,
  "currency": "USD",
  "country": "US",
  "model_input_version": "fraud-v3",
  "features_complete": false
}

If you skip this discipline, the pipeline becomes hard to replay safely and nearly impossible to reason about when models change.

Schema management matters more in streaming ML than in notebook workflows because the operational cost of broken contracts is immediate.

A Minimal Kafka Consumer for Inference

Here is a simplified Python example using Kafka:

from confluent_kafka import Consumer, Producer
import json
import time

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "fraud-inference-workers",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,
})

producer = Producer({"bootstrap.servers": "kafka:9092"})

consumer.subscribe(["model-input"])

def run_model(event: dict) -> dict:
    # Replace with your actual model call.
    score = 0.91
    return {
        "event_id": event["event_id"],
        "model_name": "fraud-risk",
        "model_version": "2026-04-10",
        "score": score,
        "processed_at": int(time.time()),
    }

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        continue

    event = json.loads(msg.value())
    result = run_model(event)

    producer.produce("prediction-results", json.dumps(result).encode("utf-8"))
    producer.flush()
    consumer.commit(msg)

This is intentionally minimal. In production you would also want:

  • dead-letter handling
  • retries with clear idempotency rules
  • metrics for lag, batch size, throughput, and failures
  • model version labeling in telemetry

But the core idea is already there: event consumption instead of synchronous request serving.

Where GPUs Actually Fit

The phrase "without GPU waste" matters because many teams attach GPUs to workflows that do not need permanently online GPU-backed endpoints.

If your model is:

  • large enough that CPU inference is too slow
  • valuable enough that acceleration matters
  • but not user-blocking enough to require instant response

then an event-driven worker pool can be a better fit than an always-on online inference service.

Instead of keeping GPUs hot behind a low-latency API, you can:

  • run a small inference worker deployment
  • scale it based on stream lag
  • batch work into short windows
  • absorb bursts through the message broker

That usually improves utilization because the queue smooths the workload.

With always-on real-time serving, the GPU often sits idle waiting for the next request while still costing money. With event-driven inference, the queue gives the hardware something closer to a continuous workload.

Scaling on Lag Instead of Guesswork

One of the strongest patterns here is autoscaling consumers on topic lag.

If the acceptable processing window is, for example, 30 seconds, then lag becomes the operational signal that matters.

A Kubernetes pattern with KEDA might look like:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fraud-inference-workers
spec:
  scaleTargetRef:
    name: fraud-inference-workers
  minReplicaCount: 0
  maxReplicaCount: 20
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: fraud-inference-workers
        topic: model-input
        lagThreshold: "500"

This is a better scaling model for many near-real-time systems than CPU-based autoscaling because it aligns directly with business freshness.

The right question is not:

  • how busy is the pod?

It is:

  • are we keeping up with the event stream within the target window?

Batch Inside the Stream

An event driven ml pipeline does not mean processing one event at a time forever.

In practice, many of the best systems use micro-batching:

  • pull events for 250ms to 2s
  • bundle them into a single model execution batch
  • write results independently

That gives you a powerful middle ground:

  • fresher than hourly or nightly batch
  • cheaper than strict per-request real-time inference
  • better accelerator utilization than single-event processing

This matters especially for transformer models, ranking models, and embedding generation, where small batches can materially improve throughput without creating unacceptable freshness delays.

The Hard Parts

This architecture is not free.

The complexity shifts from request latency to stream operations.

1. Ordering and idempotency

If events replay or get retried, your inference side effects must tolerate duplication.

That usually means:

  • event IDs
  • idempotent output writes
  • clear replay semantics

2. Feature freshness

The stream may be near real time while the feature source is not.

If enrichment depends on stale lookup tables or delayed feature pipelines, the architecture can still produce low-quality predictions quickly.

3. Failure visibility

A synchronous API failure is obvious. A streaming failure can hide as:

  • increasing lag
  • dead-letter volume
  • silently dropped outputs
  • stale prediction consumers downstream

Observability has to be designed in deliberately.

4. Backpressure and cost spikes

Queues absorb spikes, but they also hide them until lag grows.

If your autoscaling, partitioning, or consumer throughput is poorly tuned, the system can quietly drift from "near real time" into "eventually maybe."

What to Monitor

If you build streaming ml inference architecture, monitor the system by freshness and throughput, not just service uptime.

Track:

  • topic lag by consumer group
  • event age at prediction time
  • events processed per second
  • inference batch size
  • prediction failure rate
  • dead-letter rate
  • feature enrichment latency
  • per-model throughput and GPU utilization

These metrics tell you whether the pipeline is actually delivering value inside the freshness window the business needs.

Example Prometheus-style metrics:

metrics:
  - ml_stream_consumer_lag
  - ml_event_age_seconds
  - ml_inference_batch_size
  - ml_predictions_total
  - ml_prediction_failures_total
  - ml_dead_letter_total
  - gpu_utilization_percent

Notice what is missing: p95 request latency as the primary KPI. In this model, freshness and lag matter more.

When Not to Use This Pattern

Event-driven inference is the wrong answer when:

  • the user must wait for the result immediately
  • the action cannot proceed without synchronous scoring
  • compliance or audit requirements demand tight inline decisioning
  • the event flow is so small and steady that a normal online service is simpler

It is also a bad fit when teams reach for Kafka or Pulsar because it sounds modern, not because they have a real near-real-time requirement.

If a simple batch job every 15 minutes solves the business problem, do that.

The middle ground only helps when the middle ground is actually needed.

A Practical Decision Framework

Use these questions:

  1. Does the model output need to exist before the user request completes?
  2. Is the prediction still useful if delivered in 5 seconds, 30 seconds, or 2 minutes?
  3. Is workload arrival bursty enough that always-on serving wastes capacity?
  4. Can downstream systems consume the prediction asynchronously?
  5. Will topic lag be easier to operate than low-latency synchronous SLAs?

If the answers are mostly:

  • no
  • yes
  • yes
  • yes
  • yes

then the event-driven pattern is probably worth serious consideration.

A Concrete Example: Recommendations Without a Hot GPU API

Imagine an e-commerce team that wants session-aware recommendations.

Nightly batch features are too stale. A fully online LLM-style ranking service on GPU is too expensive for the traffic shape.

An event-driven design works like this:

  • page views and cart updates land on Kafka
  • a stream processor updates session state
  • model-input events are emitted every few user actions
  • GPU-backed ranking workers score micro-batches
  • top candidates are written into Redis
  • the storefront reads fresh-enough recommendations from cache

The user experiences updated recommendations within seconds, but the team avoids building a hard real-time synchronous ranking service for every interaction.

That is the middle ground in concrete terms.

Final Takeaway

The debate between batch and real time is often framed too rigidly.

For many teams, the better answer is neither.

An event driven ml pipeline lets you react quickly enough for the business while avoiding the operational and financial burden of pretending every model needs a permanently hot online serving stack.

Kafka and Pulsar make this pattern viable because they absorb bursts, decouple producers from consumers, and let you scale inference around lag and freshness instead of raw request latency.

If your workload is too dynamic for batch but not truly strict enough for synchronous online inference, this architecture is often the right middle ground.

That is the real lesson in real-time vs batch ml: the best design is often the one that matches the business freshness requirement without paying for unnecessary immediacy.

Share this article

Help others discover this content

Share with hashtags:

#Mlops#Streaming#Kafka#Pulsar#Model Deployment
RT

Resilio Tech Team

Building AI infrastructure tools and sharing knowledge to help companies deploy ML systems reliably.

Article Info

Published4/10/2026
Reading Time12 min read
Words2,387
Scale Your AI Infrastructure

Ready to move from notebook to production?

We help companies deploy, scale, and operate AI systems reliably. Book a free 30-minute audit to discuss your specific infrastructure challenges.