Designing Real-time Machine Learning Systems

8 min readJun 1, 2022

Online machine learning model learning and inferences

Source: https://res.infoq.com/articles/realtime-api-speed-observability/en/headerimage/croppted-PL96L9Ug-1596630547837.jpeg

Table of Contents:

· Why Realtime:
∘ Criteria for real-time:
∘ Types of real-time model processing:
· Real-time Model Inference:
· End to End Solution:
· Conclusion:

Why Realtime:

Most machine learning models apparently work great in a batch mode wherein the models were trained with a large amount of historical data and prediction happens on the offline(previously loaded) data periodically. But what if we really need the model to learn or predict on the live data as it arrives, meaning that we produce model training and predictions on the live data in a fraction of a sec. There are many interesting use cases which demand real-time ML learning and predictions showing immediate results and value to businesses. This real-time learning and scoring are increasingly becoming important and a necessity for immediate business impact, as the customer journey on a website is time-sensitive. As the demand for real-time learning and scoring is very short-lived, immediate action is needed to provide better customer engagement in this narrow window of opportunity. Otherwise, we might simply lose the customer to a competitor.

Criteria for real-time:

There are two evaluation criteria for real-time processing in general,

Faster data collection/processing: This heavily depends on the infrastructure to collect and process the incoming data to be able to feed to the model as input features. For example, the data streaming from Apache Kafka/Kinesis should appropriately be handled and processed by stream processing engines like Apache Flink, Beam, Storm, Spark etc. The infrastructure and framework that supports maximum through must be chosen for real-time processing.
Faster data consumption: This depends on achieving low latency and high throughput on the consumption side of the resultant data from the model. For example, the model serving infrastructure that produces/pulls model scores for the real-time requests should minimize the latency and maximize the throughput overall by considering caching layer for inferences(Prediction Store).

For building true real-time machine learning systems, both criteria should be met.

Types of real-time model processing:

When it comes to the real-time model processing, there are two types we can consider processing,

Online scoring: This produces in-session model inferences for the real-time requests/features from the consumption side. For example, a customer is adding items to the online cart and get to see some recommended items before checkout, We will discuss this more in detail in this blog.
Online training/learning (continual learning): This produces online training of a model to change/fine-tune weights and remodel/gain new learnings according to the piece of incoming data as necessary. There is so much research in this area by big companies currently. This could be another blog for later.

Real-time Model Inference:

We will walk through a few real-time use cases, and requirements and detail the end-to-end solution for the same using real-time model inference.

Business Use cases:

There are numerous use cases that demand real-time model inferences to be able to provide quick customer engagement and make an immediate impact on business. These include such as online fraud detection, real-time product recommendation, automatic vehicle recognition, dynamic pricing, online ad publishing, voice assistants, chatbots and many more.

Requirements:

Business Use case:

Provide the next best-personalized product recommendations to customers in online shopping.

Business Requirements:

Speed — Make an immediate impact by providing the value of machine learning in a fraction of a second.
Customer engagement — Continuous and richer customer journey for longer retention.

Non-Functional/Technical Requirements:

Scalability: Support many enterprise customers (each may have millions of customers)
Performance: Speed is “critical” — should be less than 100 ms (P99).
Availability: No interruptions in customer checkout journey — 99.99% (four nines)
Monitoring and Observability: Automate continuous monitoring and translate usage metrics to business objectives
Cost Optimization: Optimize and keep it iterative based on generated business revenue

Success Criteria — Evaluation Metrics

Online Metrics: (system’s performance through online evaluations on live/production data)

Product viewed(Engagement rate): The recommended product gets viewed by the customer. — Number of clicks on the product link and product page views.
Product added to cart: Number of products added to cart out of total recommended products.
Product/Cart checked out: Number of products checked out from the cart by customers.
Product returned: Number of returned products after purchase. Feedback for training the model later. Not currently used but in future.

Offline Metrics: (model’s performance through offline evaluations on test/historical data)

F1 Score: combines the precision and recall of a classifier into a single metric by taking their mean — 2 * ( P * R ) / (P + R)
Mean Average Recall/Precision: metric for a large set of users to see how the model performs correctly on a large set.

Design Patterns:

Architecting the real-time model inferences, we could find two popular design patterns that are widely being adopted in many companies,

Event-driven architecture — using Apache Kafka and Flink
Web/API requests — using a continuous flow of API requests or gRPC.

End-to-End Solution:

Features Store:

Features are individual measurable pieces of data and are crucial in machine learning to help improve performance and overall results. Feature stores are the architectural component that simplifies the process to convert raw data into features and host these features providing a single pane of glass view for any machine learning model to readily use. And this feature store is not just another data layer but a whole set of transformation services providing an end solution for feature engineering.

To address the real-time machine learning inference, these feature stores need to be built online with the stream of data coming in. As the features generation process is pretty similar for model training and serving, an adoption towards building a common feature store for both model training and serving is encouraged these days.

Implementation Frameworks: AWS DynamoDB, Redis etc.

Model Store:

The model store is a central repository to manage ML models and experiments, including the model files, artefacts, and metadata.

Model stores allow you to couple reproducibility and production-ready models. Model stores serve as the staging environment for models you will serve in the production environment. On the other hand, a model registry is nothing but a repository in which only models (no other files) are pushed and pulled like a docker registry.

For real-time model inference, the criteria for the model store is to be highly scalable and performant.

Implementation Frameworks: AWS S3, CouchDB, Cassandra etc.

Realtime Pipelines:

Setting up the pipelines for real-time machine learning takes a huge investment and might require iterations to do right. Each step in the whole ML pipeline such as data ingestion, data validation, preprocessing etc will require to be considered for real-time data processing. There may be some opportunities to unify bath data processing with real-time data processing to avoid the resource cost. There are companies that establish such unified infrastructure successfully.

During Flink Forward Virtual 2020, Weibo (social media platform) shared the design of WML, their real-time ML architecture and pipeline. Essentially, they’ve integrated previously separated offline and online model training into a unified pipeline with Apache Flink. The talk focused on how Flink is used to generating sample data for training by joining offline data (social posts, user profile, etc) with multiple streams of real-time interaction events (clickstream, read stream, etc) and extracted multimedia content feature.
Courtsey — https://medium.com/mlearning-ai/building-ml-pipelines-for-learning-from-data-in-real-time-b6dbbe9b07ce

Implementation Frameworks: Apache Storm, Apache Flink, Apache Spark etc

Realtime Inference:

The below approaches address the need for real-time inferences.

Faster models: We can achieve this by running the model on distributed computing, or optimized hardware (CPU and memory) utilization etc.
Smaller models: We can achieve this by writing a model to make predictions for only one single feature vector at a time that is streamed in, unlike doing it for multiples in batch.
Always-on models: We can achieve this by having the model deployed as a service including the model file bundled within and ready to score anytime, instead of fetching the model file dynamically based on the requests. Also, the infrastructure should support scaling the selective model services based on the model for which the requests are spiking.

Implementation Frameworks: Docker, Kubernetes or any API ecosystem.

Architecture and Design:

The below diagram covers the entire architecture for the real-time model inferences including all components for the streaming data.

Online Machine Learning Model Inferences for Streaming data

Challenges

As we build the real-time machine learning inferences system, there are a few potential challenges we encounter along the way which need a detailed look and a resolution path.

Cold-start problem: This refers to the model inferences that got produced for the new users who have limited available features (no historical features) to get quality inferences. For example, this usually happens in recommendation algorithms (collaborative filtering methods) where the interactions are less or none for the new items that came in recently. There may not be any concrete remediation for the cold-start issue as this would fade away once we collect more data for the new users/items.
Overall Speed: This latency may be introduced by feature engineering or model inference or any step in between. Making model inferences at the same rate (at least close enough) as the incoming streaming data is always a challenge. Considering scalable infrastructure for inferences services and optimized resources for data movement into features stores might help to some extent.
Hot data problem: This issue occurs on the consumption side of the produced inferences wherein the same inferences (recommended items or users) are being read by many real-time consumers. There are plenty of options to resolve this but streaming the inferences into the distributed cache will eliminate the issue to some extent.
Data drift problem: This is mostly a variation in the production data from the data that was used to test and validate the model before deploying it in production. This data drift can be detected by a proper monitoring mechanism and then the model will need to be tuned accordingly.

Handling NFRs

Scalability:

Dedicated scalable network for each customer in our multi-tenant product
Enabled auto-scaling in autonomous databases

Performance:

Rewrite batch models for real-time processing — smaller, faster

Security:

Open source vulnerabilities scanner (Model code, pipeline code etc)
Network isolated (not just processing or storage isolation)

Cost:

Open-source libraries
Reusable components (MLOps frameworks, pipeline code)

Monitoring:

Track the business metrics continuously
Detection of data drift, quality and outliers
Detection of model performance degradation, bias and fairness

Observability:

Diagnosis of model performance for patterns and trends
Provide model lineage and explainability

Retraining models:

Appropriately retrain the models to retain the performance

Conclusion:

There are many benefits in doing real-time machine learning inferences where we can make an immediate impact on customers in many use cases. However, doing real-time machine learning takes time and requires many iterations with experimentation but it is definitely achievable.

References:

The correct way to evaluate online machine learning models * Max Halford

Motivation Most supervised machine learning algorithms work in the batch setting, whereby they are fitted on a training…

maxhalford.github.io

From batch to online/stream - River

If you've already delved into machine learning, then you shouldn't have any difficulty in getting to use incremental…

riverml.xyz

Zalando Engineering Blog - A Machine Learning Pipeline with Real-Time Inference

Customers love the freedom to try the clothes first and pay later. We'd love to offer everyone the convenience of…

engineering.zalando.com