Journey to #1 It’s not the destination…it’s the journey!
What is kaggle • world's biggest predictive modelling competition platform • Half a million members • Companies host data challenges. • Usual tasks include: – Predict topic or sentiment from text. – Predict species/type from image. – Predict store/product/area sales – Marketing response
Inspired by Horse races…! • At the University of Southampton, an entrepreneur talked to us about how he was able to predict the horse races with regression!
Was curious, wanted to learn more • learned statistical tools (Like SAS, SPSS, R) • I became more passionate! • Picked up programming skills
Built KazAnova • Generated a couple of algorithms and data techniques and decided to make them public so that others can gain from it. • I released it at (http://www.kazanovaforanalytics.com/) • Named it after ANOVA (Statistics) and • KAZANI , mom’s last name.
Joined dunnhumy … and Kaggle! • Joined dunnhumby’s science team • They had hosted 2 Kaggle contests already ! • Was curious about Kaggle. • Joined a few contests and learned lots . • The community was very open to sharing and collaboration.
Learned to do Image Classification! CAT OR DOG
…and Sound classification!
And text Classification and Sentiment Identify the writer…. Who wrote this? ‘To be, or not to be’ Shakespeare or Molière Detect sentiment…. ‘The Burger is Not Bad’ Negative Bigram = Positive comment
3 Years of modelling competitions • Over 75 competitions • Participated with 35 different teams • 21 top 10 finishes • 8 times prize winner • 3 different model ing platforms • Ranked 1st out of 480,000 data scientists
What's next • Data science within dunnhumby • PhD (UCL) about recommender systems. • Kaggling for fun
• Objective: Predict if employee will require special accesses (like manual access transactions ) . • Lessons learned: 1. Logistic Regression can be great when combined with regularization to deal with high dimensionality (e.g. many variables-features) 2. Keeping the data in Sparse Format speeds up things a lot. 3. Sharing is caring! Great Participation, positive attitude towards helping others. Lots of help from the forum . Kaggle is the way to learn and improve! 4. Scikit-learn + Python is great!
RecSys Challenge 2013: Yelp business rating prediction • Link: https://www.kaggle.com/c/yelp-recsys-2013 • Objective: Predict what rating a customer will give to a business • Lessons learned: 1. Factorization machines and specifical y Libfm ( http://www.libfm.org/) are great for summarizing the relationship between a customer and a business as wel as combining many other factors. 2. Basic data manipulation (like joins, merges, aggregations) as wel as Feature engineering is important. 3. Simpler/Linear models did wel for this task.
Cause-effect pairs • Link: https://www.kaggle.com/c/cause-effect-pairs • Objective: "correlation does not mean causation “. Out of 2 series of numbers find which one is causing the other! • Lessons learned: 1. In General it seems that the series’ causing the other , has higher chance to be able to predict it better with a nonlinear model, given some noise 2. Gradient boosting machines Can be great for this task.
StumbleUpon Evergreen Classification Challenge • Link: https://www.kaggle.com/c/stumbleupon • Objective: Build a classifier to categorize webpages as evergreen (contain timeless quality) or non-evergreen • Lessons learned: 1. Some Over fitting again (cv process not right yet). Better safe than sorry from now on! 2. Impressive how tf-idf gives such a good classification from the contents of the webpage as text. 3. Dimensionality reduction with singular Value decomposition on sparse data (in a way that ‘topics’ are created) is very powerful too.
3. Meta-modelling gave a good boost. As in using some models’ predictions as features to new models. 4. I can make predictions in a field I literally know nothing about!
March Machine Learning Mania • Link: https://www.kaggle.com/c/march-machine-learning-mania • Objective: predict the 2014 NCAA Tournament • Lessons learned: 1. Combine Pleasure with data = double pleasure! ( I am a huge NBA fan)! Was also my first top 10 finish! 2. Trust the rating agencies – They do a great job and they have more data than you! 3. Simple models worked wel
Driver Telematics Analysis • Link: https://www.kaggle.com/c/axa-driver-telematics-analysis • Objective: Use telematic data to identify a driver signature • Lessons learned: 1. Geospatial stats were useful 2. Extracting features like average speed or acceleration where critical. 3. Treating this as supervised problem seemed to help.
Click-Through Rate Prediction • Link: https://www.kaggle.com/c/avazu-ctr-prediction • Objective: Predict whether a mobile ad will be clicked • Lessons learned: 1. Follow The Regularized Leader (FTRL) , which uses the hashing trick was extremely efficient in making good predictions using less 1 MB of Ram on 40+milion data rows with thousand different categories. 2. Same tricks old tricks (woe, algorithms on sparse data, meta stacking)
Homesite Quote Conversion • Link: https://www.kaggle.com/c/homesite-quote-conversion Objective: Which customers wil purchase a quoted insurance plan? • Lessons learned: 1. Generating a large pool of (500) models was really useful in exploiting AUC to the maximum. 2. Feature engineering with XGBfi and with noise imputation. 3. Exploring about 4-way interactions. 4. Retraining already trained models. 5. Dynamic collaboration is best!
So… what wins competitions? In short: •Understand the problem •Discipline •try problem-specific things or new approaches •The hours you put in •the right tools •Collaboration •Experience •Ensembling •Luck
A data Science Hero • Me: “ Don’t get Stressed” • Lucas: “ I want to. “ I want to win ” (20/04/2016) for the Santander competition. • Passed away 4 days afterwards (24/04/2016) after battling with cancer for 2.5 years. • Find Lucas’ winning solutions (and post competition threads) and learn from the best!