Most ML architecture debates get framed as a false binary. You either run batch scoring every few hours and accept staleness, or you build a fully online inference stack with strict latency targets and always-on compute.
This framing ignores the architecture many teams actually need: Event-Driven ML Pipelines. By reacting to business events (via Kafka or Pulsar), you can achieve near-real-time scoring without the massive overhead of always-on GPU clusters.
Why the Middle Ground Matters
Batch is too slow for real-time fraud detection or session-based recommendations. Conversely, synchronous real-time serving is overkill for workloads where a 2-second delay is acceptable.
Event-driven architectures decouple the trigger from the response, allowing you to smooth out traffic spikes and maximize hardware utilization.
Technical Implementation: Micro-Batching with Kafka
To get the most out of your accelerators, you shouldn't process events one by one. Micro-batching allows you to group events together for a single model execution.
# Production-grade Kafka Consumer with Micro-batching
from confluent_kafka import Consumer
import time
consumer = Consumer({'bootstrap.servers': 'kafka:9092', 'group.id': 'ml-workers'})
consumer.subscribe(['input-events'])
def process_batch(events):
# Perform inference on a batch for better GPU utilization
print(f"Processing batch of {len(events)} events")
# results = model.predict(events)
pass
batch = []
last_flush = time.time()
while True:
msg = consumer.poll(0.1)
if msg:
batch.append(msg.value())
# Trigger batch processing on size or time threshold
if len(batch) >= 100 or (time.time() - last_flush > 1.0 and batch):
process_batch(batch)
batch = []
last_flush = time.time()
Scaling on Lag with KEDA
Instead of scaling based on CPU or memory, event-driven pipelines should scale based on Topic Lag. If your consumer group is falling behind, you need more workers.
Using KEDA (Kubernetes Event-driven Autoscaling), you can scale your inference pods to zero when there's no traffic.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ml-inference-scaler
spec:
scaleTargetRef:
name: ml-inference-worker
minReplicaCount: 0 # Scale to zero!
maxReplicaCount: 20
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: ml-workers
topic: input-events
lagThreshold: "100"
Data Consistency and Feature Stores
A major challenge in streaming ML is ensuring that the features used during inference are as fresh as the event itself. This is why feature store reliability is paramount. If your model reacts to an event in 100ms but uses features that are 1 hour old, the prediction quality will suffer.
Final Takeaway
The choice between batch and real-time is no longer binary. Event-driven ML pipelines offer a cost-effective, scalable, and highly performant middle ground for the majority of production use cases. By leveraging Kafka, micro-batching, and lag-based autoscaling, you can build systems that are "fast enough" for the business without breaking the bank on idle GPU capacity.
Resilio Tech specializes in designing these high-throughput, event-driven AI architectures. We help companies bridge the gap between batch processing and real-time serving, ensuring your models are integrated seamlessly into your existing data streams while optimizing for both performance and cost.
Looking to move beyond batch processing? Talk to Resilio Tech about building an event-driven ML pipeline that scales with your business.