Introduction
Today we are pleased to announce the availability of Tribuo, an open source Java Machine Learning (ML) library. We’re releasing it under an Apache 2.0 license on GitHub for the wider ML community to use.
In the Oracle Labs Machine Learning Research Group, we've been working on deploying Machine Learning (ML) models into large production systems for years. During this time we've noticed a crucial gap between the expectations of an enterprise system, and the features provided by most ML libraries. Large software systems want to use building blocks which describe themselves and know when their inputs or outputs are invalid.
In contrast, most ML libraries expect a pile of float arrays to train a model. Then at deployment time, they expect the input to be a float array, and they produce yet another float array as the predicted output. The description of what any of these arrays mean, or what the input/output floats should look like is left to another system, either a wiki, a bug tracker, or written as a code comment. We don’t think developers want to add yet another database table per ML model just to explain what that array of output floats means.
Tracking models in production is also tricky because it requires external systems to keep the link between a deployed model and the training procedure and data. Usually the burden of these extra requirements falls on the teams who incorporate ML libraries into their products or systems, but in our group we believe it's far better to embed this into the ML library itself.
Finally, most popular ML libraries are written in dynamically-typed languages like Python and R, whereas most enterprise systems are written in a statically-typed language like Java. As a result, even implementing simple ML components requires significant code maintenance and system overhead, since code has to be written in multiple languages and operate in multiple runtimes.
Introducing Tribuo
Our group has spent the past few years building an ML library to meet these needs. The library is called *Tribuo* derived from the Latin meaning to assign or apportion. Tribuo is written in Java, and runs on Java 8 or later. All the relevant information and documentation, along with tutorials and getting started guides are available on Tribuo's website - tribuo.org. We've been using Tribuo in production inside Oracle for several years now, and we're excited to share it with you.
Tribuo provides the standard ML functionality that you'd expect from an ML library: classification, clustering, anomaly detection, and regression algorithms. Tribuo has data loading pipelines, text processing pipelines, and feature level transformations for operating on data once it's been loaded in.
It also has a full suite of evaluations for each of the supported prediction tasks. Unlike other systems, Tribuo knows what its inputs are, and can describe the range and type of each input. Each feature is named, so you can't confuse it for another feature just because the input processing system gave it the same id number (in fact, in Tribuo you never need to see its id number). This means a Tribuo model knows when you've given it features that it has never seen before, which is particularly useful when working with natural language processing.
Tribuo's models also know what are their outputs, and those outputs are strongly typed. No more staring at a float wondering if it's a probability, a regressed value, or a cluster id; in Tribuo each of these is a separate type, and the model can describe the types and ranges it knows about.
Tracking and reproducing models with provenance
Keeping track of how any given production model was generated is tricky using other ML libraries, as their models don't store the training data source, transformations, or the training algorithm hyperparameters. There are libraries which layer tracking code on top of an existing model training script, but we feel that this information should be embedded into the model (or evaluation) itself. This training time information, coupled with the information about model inputs and outputs stored in every Tribuo model means that they are *self-describing*.
Tribuo's use of strongly typed inputs and outputs means it can track the model construction process, from the point data is loaded into Tribuo, through any train/test splits or data set transformations, through model training (recording all the hyperparameters), and finally to evaluation on a test set. This tracking (or *provenance*) information is baked into all the models and evaluations.
Tribuo's provenance system is for more than just tracking models in production. Each provenance can generate a configuration which precisely rebuilds the training pipeline to reproduce the model or evaluation (assuming you've still got the original data), or to build a tweaked model on new data or new hyper-parameters. This means you always know what a Tribuo model is, where it came from and how to recreate it if required. It even records all the PRNG seeds, so a model training run is perfectly reproducible.
Deploying models from other systems & languages
Tribuo provides interfaces to ONNX Runtime, TensorFlow and XGBoost. This allows models stored in onnx format, or trained in TensorFlow or XGBoost to be deployed alongside Tribuo's native models. Our group contributes to all three projects: we wrote ONNX Runtime's Java support, we contributed patches to ensure XGBoost works across platforms and Java versions and we contributed training support to the upcoming TensorFlow JVM releases.
The onnx model support is particularly exciting as it allows the deployment in Java of models trained using popular Python packages like scikit-learn and pytorch.
Our TensorFlow and XGBoost interfaces also allow the training of Tribuo models using these systems. When trained through Tribuo they provide all the type safety and provenance benefits that every Tribuo model has. The XGBoost support is fully functional and we've been using it in production internally for years.
TensorFlow support is still experimental as we're awaiting the first release from the TensorFlow JVM SIG before Tribuo's TF API can be finalized. That first TF JVM release will also enable training TF models in Java without defining anything in Python first.
Source: oracle.com
0 comments:
Post a Comment