Common Items • Time separated events require a join • aggregation phase (.group.sum) just adds up some (record of) values. • This aggregation is associative, so we don’t need to look at history to produce today’s results.
Folding (training/ updating)
Common Items • This aggregation is NOT associative, so we need to look at history to produce today’s results. • Models are joined with events with a custom cogroup. • The update logic lives outside of the job (in the model class?)
• Algebird (github.com/twitter/ algebird) includes many approximation algorithms. • MinHash gives approximate set similarity, useful for LSH. • HyperLogLog / CountMinSketch for scalable approximate set size, event counts.
• Release 0.9.0 (~ 2 weeks): • REPL contributed (thanks Wibi!) • Typed-API improvements (joining, implementation, combinators) • optimizing Matrix API • improved function serialization • some API warts removed
• Explore spark support: • Preferred option: cascading backend for spark. • Does this speed-up ETL (extract, transform, load) jobs significantly? • Can spark OOM issues be handled for large multi-tenant use-cases?
• Easier integration into larger tools/libs: • Summingbird uses scalding as a library: learned a lot about what is easy and not. Some patterns can be added to scalding. • Would love to make it easier to build and distribute ML/Linear Algebra libraries. How to compose?