Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine learning. It is a framework for building applications, but also includes packaged, end-to-end applications for collaborative filtering, classification, regression and clustering.
Lambda Architecture is a useful framework to think about designing big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter.
Some of the key requirements in building this architecture include: Fault-tolerance against hardware failures and human errors Support for a variety of use cases that include low latency querying as well as updates Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done Extensibility so that the system is manageable and can accommodate newer features easily.
- All data entering the system is dispatched to both the batch layer and the speed layer for processing.
- The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
- The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
- The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
- Any incoming query can be answered by merging results from batch views and real-time views.
Developers can consume Oryx 2 as a framework for building custom applications as well. Following the architecture overview below,If you’re looking to deploy a ready-made, end-to-end application for collaborative filtering, clustering or classification,here are the steps to follow :
- Prepare your Hadoop cluster with Cluster Setup
- Get a Release
- Prepare a config file from the Configuration Reference
- Run the binaries with Running Oryx
Learn about the REST API endpoints here API Endpoint Reference
Oryx2 consists of three tiers
- Lambda Tier – Providing base implementation which is not specific to machine learning. It internally contains side-by-side cooperating layers of the lambda architecture:
- A Batch Layer, which computes a new “result” (think model, but, could be anything) as a function of all historical data, and the previous result. This may be a long-running operation which takes hours, and runs a few times a day for example.
- A Speed Layer, which produces and publishes incremental model updates from a stream of new data. These updates are intended to happen on the order of seconds.
- A Serving Layer, which receives models and updates and implements a synchronous API exposing query operations on the result.
- A data transport layer, which moves data between layers and receives input from external sources
- ML Tier Implementation – The ML tier is simply an implementation and specialization of the generic interfaces mentioned above, which implement common ML needs and then expose a different ML-specific interface for applications to fill in.
- End-to-end Application Implementation Tier – In addition to being a framework, Oryx 2 contains complete implementations of the batch, speed and serving layer for three machine learning use cases. These are ready to deploy out-of-the-box, or to be used as the basis for a custom application:
- Collaborative filtering / recommendation based on Alternating Least Squares
- Clustering based on k-means
- Classification and regression based on random decision forests
Download link : https://github.com/OryxProject/oryx/releases