このページは https://speakerdeck.com/ifesdjeen/clojure-a-sweetspot-for-analytics の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byαλεx π

1年以上前 (2015/06/25)にアップロードinテクノロジー

EuroClojure 2015 Talk Slides:

Clojure is getting more and more traction, and more people use it...

EuroClojure 2015 Talk Slides:

Clojure is getting more and more traction, and more people use it for all kinds of backend processing. During last 3 years us in ClojureWerkz concentrated on making lives of backend developers simple. Today, Clojure is one of the best choices for Analytics and Data-Driven Backends.

I'll tell you about our motivation, experiences and our success story, how we made a data processing backend, currently pushing millions of messages per second, how Clojure made our development cycles and time-to-production shorter, lives of our devs better, and made our customers happier.

About the speaker: Alex is working on making backends for analytics and data processing. He's been involved with Clojure since 2011, and co-created ClojureWerkz, is actively involved in development and maintenance of many Clojure libraries. Spends most of free time reading Math and Probability Theory textbooks, figuring out how things work.

- Clojure, Plain and Simple約3年前 by Ben Mabey
- Reactor によるデータインジェスチョン2年弱前 by Akihiro Kitada
- Akka streams1年以上前 by mircodotta

- Sweetspot

for

Analytics - ClojureWerkz

35+ high-quality Clojure libraries

User reports from all over the world

20+ active contributors

We value documentation - Why

Clojure

is so awesome? - Overtone

isn’t just for music - Minority Report

with Clojure REPL - Great for Math™
- Talk Expectations
- What is even

Analytics - Getting something

out of nothing - Problems to solve

understand the shape of data

in a pace that suits your business - Problems to solve

reliably

classify

and predict the outcomes - Problems to solve

model

and understand your chances - Challenges
- Clojure Shine
- Anglican

Is what you open when they say

“what are the odds” - Anglican

Probabilistic Programming DSL/Language

Clojure Macros at their very best - Anglican

Coin flips

(let [data (->> #(sample (flip 0.5))

repeatedly

(take 50)

(map #(if % "heads" "tails"))

frequencies)]

(plot/bar-chart

(keys data)

(vals data))) - Coin Flips

Coin flips: fair coin, few trials - Coin Flips

Coin flips: rigged coin, many trials - Coin Flips

Coin flips: fair coin, many trials - Anglican

Simulating Cassandra Cluster Sizes

1 node can handle 10K requests

Latency is normally distributed

with mean of 20ms

standard deviation of 5ms

“Extra” requests add overhead exponentially

disclaimer: model is simplified and numbers are made up! - Simulating Cassandra Cluster Sizes

(def base-requests (* 10 1000))

(defquery cluster-latency [n write-rate]

(let [per-node (/ write-rate n)

overhead (/ 1000

(if (> per-node base-requests)

(- per-node base-requests)

1))]

(predict :latency

(+ (sample (exponential overhead))

(sample (normal 20 5))))))

disclaimer: model is simplified and numbers are made up! - Simulating Cassandra Cluster Sizes

5 Nodes / 50K requests per second - Simulating Cassandra Cluster Sizes

5 Nodes / 500K requests per second - Simulating Cassandra Cluster Sizes

30 Nodes / 500K Requests - Anglican

Subset of Clojure, compiled into CPS-style fns

Stackless language

Built-in memoisation

DSL for building sampling fns for distributions - Statistiker

En statistiker er en person som jobber innen

faget statistikk. - Implementing gaussian

Naïve Bayes

Algorithm - Implementing Naïve Bayes Algorithm
- P(blue)=

Number of Blue

P(red)=

Number of Red

Total number of objects

Total number of objects - Number of Blue near X

Number of Red near X

P(X | blue)=

P(X | red)=

Total number of blue

Total number of Red - Model (prior)

Number of Blue

Number of Blue

P(blue)=

P(red)=

Total number of objects

Total number of objects

(defn make-model

[train-data]

(let [total (->> train-data

vals

(map count)

(reduce +))]

(for [[k v] train-data]

[k {:p (/ (count v) total)

:evidence (->> v

transpose

(map (fn [v]

{:mean (mean v)

:variance (variance v)})))}]))) - Classifier (posterior)

Number of Blue near X

Number of Red near X

P(X | blue)=

P(X | red)=

Total number of blue

Total number of Red

(defn posterior-prob

[point variance mean]

(* (/ 1 (sqrt (* 2 pi variance)))

(exp (/ (* -1 (pow (- point mean) 2))

(* 2 variance)))))

(map

(fn [point {:keys [mean variance]}]

(posterior-prob point variance mean))

model) - Number of Blue near X

Number of Red near X

P(X | blue)=

P(X | red)=

Total number of blue

Total number of Red - Implementing

Linear Regression

with Gradient Descent - Linear Regression with Gradient Descent

(s/defrecord GradientProblem

[^{:s ObjectiveFunction}

objective-fn

^{:s ObjectiveFunctionGradient}

objective-fn-gradient]) - Linear Regression: Objective Function

Basically, the distance between predicted and actual Y:

(objective-function

(fn [intercept slope]

(let [f (line intercept slope)

res (->> points

(map (fn [[x y]]

(sqr (- y (f x)))))

(reduce +))]

res))) - Linear Regression: Objective Function

Basically, the distance between predicted and actual Y:

(objective-function-gradient

(let [factors (->> points

(map butlast)

(map #(cons 1 %)))

y (map last points)]

(fn [& point]

(let [xT (matrix/transpose factors)

m! (matrix/inverse (matrix/dot xT factors))

b (matrix/dot xT y)]

(ops/- (matrix/mmul m! b)

point))))) - Linear Regression: Objective Function
- Experience Report

Bunch of JVM libraries available

clojure.matrix is great

clojure.match greatly helps with algos

Clojure fns are easy to test

With immutable DSs nothing goes wrong - Balagan

When `update-in` and `get-in` is not enough - Balagan

Nested data structures

Map-inside-vector-inside-map

Straightforward query language - Balagan

[:* :* even? :c] - Balagan

:* wildcard matches

Custom key matchers (essentially any fn)

`get-in`-like key matches - Balagan

(b/select {:a {:b [{:c 1}

{:c 2}

{:c 3}]}}

[:* :* even? :c])

;; => [1 3] - Balagan

Walking-transformations

Match paths and apply updates - Balagan

(b/update {:a {:b [1 2 3]}}

[:a :b :*] inc)

;; => {:a {:b [2 3 4]}} - Balagan

Operations on complex data structures

DSL for querying and transforming - Balagan

Inspired by Enlive

Extensible walkers for custom data structs

Clojure/ClojureScript enabled via cljx - Meltdown

Stream Processing with a long half-life - Meltdown

Data processing pipeline

Performant

Tuneable

Multiple backends (Disruptor, Queues, sync…)

CEP/EEP for everyone - Base API

Enqueue

(reactor/notify <key> <event)

Handle

(reactor/on <selector> <event handler>) - Base API

filter*

map*

batch*

group*

reduce*

consume* - DSLs

“Named” streams:

(reactor/on ($ “a”)

filter*

)

(reactor/on ($ “b”)

map*

)

(reactor/on ($ “c”)

batch*

)

(reactor/on ($ “a”)

reduce*

) - DSLs

Reduce boilerplate for processing topologies

Implicit wiring between occurring parts

No changes to the base API

Attach parts of the stream for better composition - DSLs

“Anonymous” streams:

(streams/stream ($ “a”)

filter*

map*

batch*

reduce*

) - DSLs

On-premise streams

Per-entity streams

Decouple data processing pipelines

Avoid hash lookups within sync operations

Parallelize

Maintain streams independently - DSLs

“Lazy” streams:

(streams/lazy-stream ($ #“a”)

filter*

map*

batch*

reduce*

) - DSLs: Macros

Powerful way to hide the “wiring”

No changes to API

Completely different handling logic

Eager, delayed, wired, etc… streams - DSLs

Max-out single-box performance

Pluggable back-end

Anonymous, lazy, named streams - Buffy

the byte buffer slayer - byte buffer
- Buffy

Composeable binary protocols

Partial deserialisation

Named access to serialised parts - Buffy

Create a spec out of parts

(spec :my-field-1 (int32-type)

:my-field-2 (string-type 10))

Memory Layout

0 4 14

+------------+-------------------------+

| my-field-1 | my-field-2 |

| (int) | (10 string) |

+------------+-------------------------+ - Buffy

Use spec to access fields like in a map

(let [s (spec :int-field (int32-type)

:string-field (string-type 10))

buf (compose-buffer s)]

(set-field buf :int-field 101)

(get-field buf :int-field))

;; => 101 - Buffy

Or decode them all together

(let [s (spec :first-field (int32-type)

:second-field (string-type 10)

:third-field (boolean-type))

buf (compose-buffer spec)]

(set-fields buf {:first-field 101

:second-field "string"

:third-field true})

(decompose buf))

;; => {:third-field true

:second-field "string"

:first-field 101} - Buffy DSLs

Composite (tuple) types

(composite-type (int32-type)

(string-type 10)) - Buffy DSLs

Array/vector types

(repeated-type (string-type 10) 5) - Buffy DSLs

And recursion!

(repeated-type

(composite-type (int32-type)

(string-type 10))

5) - Buffy DSLs

Dynamic types: netstrings

(def dynamic-string

(frame-type

(frame-encoder [value]

length (short-type) (count value)

string (string-type (count value))

value)

(frame-decoder [buffer offset]

length (short-type)

string (string-type (read length buffer offset)))

second)) - Lessons Learned

Protocols help to abstract a notion of Data Type

Data Types are extendable!

Macros for creating the custom decoders - Conclusions