Real World Machine Learning in Java 8 at Fumankaitori.com Mathieu Dumoulin, Chief Data Scientist fumankaitori.com, Data Science Team manager at en-japan
Today’s menu ● About me and 不満買取センータ ● The business problem: Post pricing ● Project Overview ○ Why use ML ○ How to use ML in projects ○ How we used ML in this project ● Results ● Live code (depends on time) ● Conclusion
Presentation goals ● Machine learning is possible by any Java Engineer ● Java is a great programming language for real- world machine learning systems ● New ML APIs make it easy to focus on the problem and the data, and get a well-performing model “for free” ● You don’t need a ph.D. to use machine learning, just some self-study, good tools and libraries and build experience one project at a time
Google map for Quebec City here!
My Work: Java SE, Hadoop Engineer, Data Scientist
● Launched in Mar 2015. Provide web/Android/iOS applications. ● An application to collect data about people's dissatisfactions. ● Features: ○ Users can post any dissatisfaction of any products/services. ○ Users get points as a reward for their posts. And the point is exchangeable with coupon code of EC sites. ● 250,000 users with 1,500,000 posts (accumulated) (end of Nov 2015)
Problem statement: post point value prediction ● Fuman user posts have a money value ● We want to give more points for “good” posts ● At first, operations staff checked all posts, but they can’t check 10,000 posts each day... We made rules, but point value was worse: ● Rules can’t check the content of the posts ● Rules always miss something ● Making hundreds or thousands of rules by hand is ridiculous
Real world ML project overview ● Machine Learning Workflow ● Data Scientist and Java Engineer roles ● Java for production ML ● Java 8 benefits ● Our point prediction system details ● Results
data Machine Learning Workflow Load data Load new data data, labels (known result) data the same Extract Features Extract Features feature vectors, labels iterate feature vectors best model Train Model Predict using model prediction, labels predictions Evaluate vs. business goal Act on prediction
Workflow for machine learning system 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model
Data Scientist’s role Choose features 1. Set a goal with business value 2. Get data (fuman user Build many models posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model
Software Engineer’s role Get data from data source Implement production code 1. Set a goal with business value 2. Get data (fuman user posts) with a price already set 3. Transform data for input into machine learning algorithm 4. Train and evaluate machine learning model until reach goal 5. Deploy best model Implement and integrate into production system
But we don’t have a data scientist...
You can outsource!
Java for production ML ● Easy integration with Java applications ● Fast (vs. Python or R) ● Easy to program (vs. C++) ● Most common enterprise programming language, IDE support and excellent support libraries ● Lots of state of the art machine learning libraries have a Java API
Machine Learning libraries
Benefits of Java 8 ● Java 8’s functional style is a very good match with ML operations a. Feature extraction: data in → transform → data out ● Java 8’s streams and Lambdas a. Code is easier to understand and less verbose ● Easy parallel code a. Faster “for free”
Post point prediction system: step by step Iterate until results meet business goals ● Train/Test split ● Categorical features transformation Feature Fuman ● Select best features Extraction DB ● Try many algorithms CSV format ● Tune algorithms ● Evaluate models posts, label ● REST Prediction API Prediction Service DR Prediction API
Feature Extraction details ● We added character and words statistics about each fuman user post ○ Number of hiragana, katakana, kanji, alphabet characters and words ○ Number of words, length of words ○ Ratio of hiragana, katakana, kanji, alphabet words to the number of tokens in a post ● User profile information ○ age, gender, job category, etc. ● Bag-of-word models: ○ Words using Tf-Idf, removing stopwords (これ、あれ、それ、です、など、 …) ○ Part-of-speech （名詞、動詞、形容詞、 …） ○ Word types features (hiragana word, katakana word, kanji word, …)
Feature Extraction: Example マックのポテト揚げたてでお願いしたのに、揚げたてじゃ なかった。
We reached the project goal! Our result: 3.5 point difference from human judgement ● DataRobot’s best model ○ eXtreme Gradient Boosted Trees ○ RMSE: 3.54 ○ MSE: 12.53 Business result: ● Higher quality evaluation than rules ● Operation staff don’t need to manually check posts ● We can validate points every day
Deployment issues ● Problem: The Prediction API was very slow (>1s / post) so we had to run it as a batch process each night. ● We want: Make predictions locally with low latency, without losing the good prediction performance we already have. We solved this problem using the excellent open source, distributed machine learning library H O by H2o.ai. 2 Co-founder: Cliff Click, who made the Java HotSpot Server Compiler
Post point prediction system: Current system ● Train/Test split ● Categorical features Feature Fuman transformation Extraction DB CSV format ● Distributed, fast and state posts, label of the art algorithms ● POJO prediction class generation make feature Prediction vectors POJO Fuman Webapp get new post values Prediction Service
Train Production Model: H O 2
Overview: Making Predictions ● Use the prediction POJO generated by H2O ● For each new post query Prediction Service ○ Convert to vector (Double for H2O) ○ Get prediction from prediction POJO (Double value, round to integer) ○ Update database with predicted price
We reached the business goal! Project goal: Get similar performance from H2O as from DataRobot ● H2O: Train a new model for ● DataRobot’s best model ○ eXtreme Gradient Boosted Trees production ○ RMSE: 3.54 ○ GBM (Gradient Boosting Machine) ○ MSE: 12.53 ○ MSE: 12.8 H2O is not ideal to explore different models and features, but for production, it is FAST with similar predictive performance. It is implemented in pure Java (Github).
Real world ML loves Java! ● Java is a top choice for making production machine learning systems ● Benefits of Java 8 makes Java fun and relevant again ● Integration in a Java web application was not hard ● Java is not a good choice for experimentation ○ Start with a Python prototype with Scikit-learn ○ Use a Machine Learning service like DataRobot.com
You can use ML in your projects! ● Web API services are like a personal data scientist ○ No need for Data Scientist for simple use of ML ○ But harder dataset will need expertise ● Real world ML projects needs Engineers: ○ Get data to train a good model (log files, sales results, mail campaign results,…) ○ Transform data into input for ML library or web service ○ Deploy and integrate into production ● Most steps are just normal programming ○ Get data from DB ○ Transform data into a CSV ○ Call a REST API or Java POJO to make predictions ○ Integrate with the system that needs predictions
Feature engineering with streams and lambdas The goal is to take raw data from the DB and create arrays of numerical or categorical features. 1. Get Fuman user post data from DB -> UserPost 2. Learn the vocabulary of all user posts word types 3. Create the dataset: a. For each post, i. Add the statistics features ii. Add the word types features 4. Transform to csv output (for DataRobot) Instances are Weka SparseInstance (sparse vectors for memory efficiency), but in retrospect, a specialized vector library would have been better, I think. Weka is a terrible production library