このページは http://www.slideshare.net/larsga/introduction-to-big-datamachine-learning の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/05/15)にアップロードinテクノロジー

A short (137 slides) overview of the fields of Big Data and machine learning, diving into a coupl...

A short (137 slides) overview of the fields of Big Data and machine learning, diving into a couple of algorithms in detail.

- Introduction to Machine Learning

2012-05-15

Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga

1 - Agenda

• Introduction

• Theory

• Top 10 algorithms

• Recommendations

• Classification with naïve Bayes

• Linear regression

• Clustering

• Principal Component Analysis

• MapReduce

• Conclusion

2 - The code

• I’ve put the Python source code for

the examples on Github

• Can be found at

– https://github.com/larsga/py-

snippets/tree/master/machine-learning/

3 - Introduction

4 - 5
- 6
- What is big data?

Big D

ata is

a is

Sma

Sm ll Dat

ll D a is

any th

t in

i g

whe

h n is fit

f in

whic

i h is

h

is

RA

R M

A . B

. ig

i Da

D ta

t is

is

crash

as

whe

h n is c

is rash

as

Excel.l

be

b cause i s not

s

fit

i in RAM

A .

Or, in other words, Big Data is data

in volumes too great to process by

traditional methods.

https://twitter.com/devops_borat

7 - Data accumulation

• Today, data is accumulating at

tremendous rates

– click streams from web visitors

– supermarket transactions

– sensor readings

– video camera footage

– GPS trails

– social media interactions

– ...

• It really is becoming a challenge to

store and process it all in a

meaningful way

8 - From WWW to VVV

• Volume

– data volumes are becoming unmanageable

• Variety

– data complexity is growing

– more types of data captured than previously

• Velocity

– some data is arriving so rapidly that it must

either be processed instantly, or lost

– this is a whole subfield called “stream

processing”

9 - The promise of Big Data

• Data contains information of great

business value

• If you can extract those insights you

can make far better decisions

• ...but is data really that valuable? - 11
- 12
- “quadrupling the average

cow's milk production since

your parents were born”

"When Freddie [as he is

known] had no daughter

records our equations

predicted from his DNA that

he would be the best bull,"

USDA research geneticist Paul

VanRaden emailed me with a

detectable hint of pride. "Now

he is the best progeny tested

bull (as predicted)."

13 - Some more examples

• Sports

– basketball increasingly driven by data analytics

– soccer beginning to follow

• Entertainment

– House of Cards designed based on data analysis

– increasing use of similar tools in Hollywood

• “Visa Says Big Data Identifies Billions of

Dollars in Fraud”

– new Big Data analytics platform on Hadoop

• “Facebook is about to launch Big Data

play”

– starting to connect Facebook with real life

14https://delicious.com/larsbot/big-data - Ok, ok, but ... does it apply to

our customers?

• Norwegian Food Safety Authority

– accumulates data on all farm animals

– birth, death, movements, medication, samples, ...

• Hafslund

– time series from hydroelectric dams, power prices,

meters of individual customers, ...

• Social Security Administration

– data on individual cases, actions taken, outcomes...

• Statoil

– massive amounts of data from oil exploration,

operations, logistics, engineering, ...

• Retailers

– see Target example above

– also, connection between what people buy, weather

forecast, logistics, ...

15 - How to extract insight from

data?

Monthly Retail Sales in New South

Wales (NSW) Retail Department Stores

16 - Types of algorithms

• Clustering

• Association learning

• Parameter estimation

• Recommendation engines

• Classification

• Similarity matching

• Neural networks

• Bayesian networks

• Genetic algorithms

17 - Basically, it’s all maths...

• Linear algebra

• Calculus

• Probability theory

•

Only

nly 10

1 % in

Graph theory

devops a

s re

• ...

know h

ow of

work wit

i h Big

h

Dat

D a. Only 1%

nly

are realize

the

h y are

e need

2 B

ig

i Da

D ta

t for

f

fault to

t ler

le ance

18

https://twitter.com/devops_borat - Big data skills gap

• Hardly anyone knows this stuff

• It’s a big field, with lots and lots of

theory

• And it’s all maths, so it’s tricky to

learn

http://www.ibmbigdatahub.com/blog/addressing-big-data-

19 hsttkipll:s//w

-g ik

ap ibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap - Two orthogonal aspects

• Analytics / machine learning

– learning insights from data

• Big data

– handling massive data volumes

• Can be combined, or used separately

20 - How to process Big Data?

• If relational databases are not

enough, what is?

Mining of Big

of

Da

D ta

t is

is

problem

m

so

s lve in 2013

with z

h grep

22

https://twitter.com/devops_borat - MapReduce

• A framework for writing massively

parallel code

• Simple, straightforward model

• Based on “map” and “reduce”

functions from functional

programming (LISP)

23 - NoSQL and Big Data

• Not really that relevant

• Traditional databases handle big data

sets, too

• NoSQL databases have poor analytics

• MapReduce often works from text files

– can obviously work from SQL and NoSQL, too

• NoSQL is more for high throughput

– basically, AP from the CAP theorem, instead

of CP

• In practice, really Big Data is likely to

be a mix

– text files, NoSQL, and SQL

24 - The 4th V: Veracity

“The greatest enemy of knowledge is

not ignorance, it is the illusion of

knowledge.”

Daniel Borstin, in The Discoverers

(1983)

95% of tim

t

e

im ,

whe

h n is c

is lean

Big Dat

D a is get

Little

t Da

D ta

t

25

https://twitter.com/devops_borat - Data quality

• A huge problem in practice

– any manually entered data is suspect

– most data sets are in practice deeply

problematic

• Even automatically gathered data can

be a problem

– systematic problems with sensors

– errors causing data loss

– incorrect metadata about the sensor

• Never, never, never trust the data

without checking it!

– garbage in, garbage out, etc

26 - Conclusion

• Vast potential

– to both big data and machine learning

• Very difficult to realize that potential

– requires mathematics, which nobody knows

• We need to wake up!

28 - Theory

29 - Two kinds of learning

• Supervised

– we have training data with correct answers

– use training data to prepare the algorithm

– then apply it to data without a correct

answer

• Unsupervised

– no training data

– throw data into the algorithm, hope it makes

some kind of sense out of the data

30 - Some types of algorithms

• Prediction

– predicting a variable from data

• Classification

– assigning records to predefined groups

• Clustering

– splitting records into groups based on

similarity

• Association learning

– seeing what often appears together with

what

31 - Issues

• Data is usually noisy in some way

– imprecise input values

– hidden/latent input values

• Inductive bias

– basically, the shape of the algorithm we

choose

– may not fit the data at all

– may induce underfitting or overfitting

• Machine learning without inductive

bias is not possible

32 - Underfitting

• Using an algorithm that cannot

capture the full complexity of the data

33 - Overfitting

• Tuning the algorithm so carefully it

starts matching the noise in the

training data

34 - “What if the knowledge and data we have are

not sufficient to completely determine the

correct classifier? Then we run the risk of just

hallucinating a classifier (or parts of it) that is

not grounded in reality, and is simply

encoding random quirks in the data. This

problem is called overfitting, and is the

bugbear of machine learning. When your

learner outputs a classifier that is 100%

accurate on the training data but only 50%

accurate on test data, when in fact it could

have output one that is 75% accurate on

35

http://homes.cs.bo

w th

as ,

h it

nhas

gto ov

n. erf

ed it.

u ”

/~pedrod/papers/cacm12.pdf - Testing

• When doing this for real, testing is

crucial

• Testing means splitting your data set

– training data (used as input to algorithm)

– test data (used for evaluation only)

• Need to compute some measure of

performance

– precision/recall

– root mean square error

• A huge field of theory here

– will not go into it in this course

– very important in practice

36 - Missing values

• Usually, there are missing values in

the data set

– that is, some records have some NULL

values

• These cause problems for many

machine learning algorithms

• Need to solve somehow

– remove all records with NULLs

– use a default value

– estimate a replacement value

– ...

37 - Terminology

• Vector

– one-dimensional array

• Matrix

– two-dimensional array

• Linear algebra

– algebra with vectors and matrices

– addition, multiplication, transposition, ...

38 - Top 10 algorithms

39 - Top 10 machine learning algs

1. C4.5

No

2. k-means clustering

Yes

3. Support vector machines

No

4. the Apriori algorithm

No

5. the EM algorithm

No

6. PageRank

No

7. AdaBoost

No

8. k-nearest neighbours class. Kind of

9. Naïve Bayes

Yes

10.CART

No

From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006.

40

“Top 10 algorithms in data mining”, by X. Wu et al - C4.5

• Algorithm for building decision trees

– basically trees of boolean expressions

– each node split the data set in two

– leaves assign items to classes

• Decision trees are useful not just for

classification

– they can also teach you something about the

classes

• C4.5 is a bit involved to learn

– the ID3 algorithm is much simpler

• CART (#10) is another algorithm for

learning decision trees

41 - Support Vector Machines

• A way to do binary classification on

matrices

• Support vectors are the data points

nearest to the hyperplane that divides

the classes

• SVMs maximize the distance between

SVs and the boundary

• Particularly valuable because of “the

kernel trick”

– using a transformation to a higher dimension to

handle more complex class boundaries

• A bit of work to learn, but manageable

42 - Apriori

• An algorithm for “frequent itemsets”

– basically, working out which items frequently

appear together

– for example, what goods are often bought

together in the supermarket?

– used for Amazon’s “customers who bought

this...”

• Can also be used to find association

rules

– that is, “people who buy X often buy Y” or

similar

• Apriori is slow

– a faster, further development is FP-growth

43

http://www.dssresources.com/newsletters/66.php - Expectation Maximization

• A deeply interesting algorithm I’ve

seen used in a number of contexts

– very hard to understand what it does

– very heavy on the maths

• Essentially an iterative algorithm

– skips between “expectation” step and

“maximization” step

– tries to optimize the output of a function

• Can be used for

– clustering

– a number of more specialized examples, too

44 - PageRank

• Basically a graph analysis algorithm

– identifies the most prominent nodes

– used for weighting search results on Google

• Can be applied to any graph

– for example an RDF data set

• Basically works by simulating random walk

– estimating the likelihood that a walker would be

on a given node at a given time

– actual implementation is linear algebra

• The basic algorithm has some issues

– “spider traps”

– graph must be connected

– straightforward solutions to these exist

45 - AdaBoost

• Algorithm for “ensemble learning”

• That is, for combining several

algorithms

– and training them on the same data

• Combining more algorithms can be

very effective

– usually better than a single algorithm

• AdaBoost basically weights training

samples

– giving the most weight to those which are

classified the worst

46 - Recommendations

47 - Collaborative filtering

• Basically, you’ve got some set of items

– these can be movies, books, beers, whatever

• You’ve also got ratings from users

– on a scale of 1-5, 1-10, whatever

• Can you use this to recommend items

to a user, based on their ratings?

– if you use the connection between their

ratings and other people’s ratings, it’s called

collaborative filtering

– other approaches are possible

48 - Feature-based

recommendation

• Use user’s ratings of items

– run an algorithm to learn what features of

items the user likes

• Can be difficult to apply because

– requires detailed information about items

– key features may not be present in data

• Recommending music may be

difficult, for example

49 - A simple idea

• If we can find ratings from people

similar to you, we can see what they

liked

– the assumption is that you should also like it,

since your other ratings agreed so well

• You can take the average ratings of

the k people most similar to you

– then display the items with the highest

averages

• This approach is called k-nearest

neighbours

– it’s simple, computationally inexpensive, and

50

works pretty well

– there are, however, some tricks involved - MovieLens data

• Three sets of movie rating data

– real, anonymized data, from the MovieLens site

– ratings on a 1-5 scale

• Increasing sizes

– 100,000 ratings

– 1,000,000 ratings

– 10,000,000 ratings

• Includes a bit of information about the

movies

• The two smallest data sets also contain

demographic information about users

51

http://www.grouplens.org/node/73 - Basic algorithm

• Load data into rating sets

– a rating set is a list of (movie id, rating) tuples

– one rating set per user

• Compare rating sets against the user’s

rating set with a similarity function

– pick the k most similar rating sets

• Compute average movie rating within

these k rating sets

• Show movies with highest averages

52 - Similarity functions

• Minkowski distance

– basically geometric distance, generalized to

any number of dimensions

• Pearson correlation coefficient

• Vector cosine

– measures angle between vectors

• Root mean square error (RMSE)

– square root of the mean of square

differences between data values

53 - Data I added

User

Movie

Ratin

Title

ID

ID

g

6041

347

4

Bitter Moon

6041

1680

3

Sliding Doors

6041

229

5

Death and the Maiden

6041

1732

3

The Big Lebowski

6041

597

2

Pretty Woman

6041

991

4

Michael Col ins

6041

1693

3

Amistad

6041

1484

4

The Daytrippers

6041

427

1

Boxing Helena

6041

509

4

The Piano

6041

778

5

Trainspotting

6041

1204

4

Lawrence of Arabia

6041

1263

5

The Deer Hunter

6041

1183

5

The English Patient

6041

1343

1

Cape Fear

6041

260

1

Star Wars

6041

405

1

Highlander III

6041

745

5

A Close Shave

Note these. Later we’ll see Wallace &

6041

1148

5

The Wrong Trousers

Gromit popping up in recommendations.

6041

1721

1

Titanic

This is the 1M data set

54 https://github.com/larsga/py-snippets/tree/master/machine-

learning/movielens - Root Mean Square Error

• This is a measure that’s often used to judge

the quality of prediction

– predicted value: x

– actual value: y

• For each pair of values, do

– (y - x)2

• Procedure

– sum over all pairs,

– divide by the number of values (to get average),

– take the square root of that (to undo squaring)

• We use the square because

– that always gives us a positive number,

– it emphasizes bigger deviations

55 - RMSE in Python

def rmse(rating1, rating2):

sum = 0

count = 0

for (key, rating) in rating1.items():

if key in rating2:

sum += (rating2[key] - rating) ** 2

count += 1

if not count:

return 1000000 # no common ratings, so distance

is huge

return sqrt(sum / float(count))

56 - Output, k=3

===== User 0 ==================================================

User # 14 , distance: 0.0

Deer Hunter, The (1978) 5 YOUR: 5

===== User 1 ==================================================

User # 68 , distance: 0.0

Close Shave, A (1995) 5 YOUR: 5

===== User 2 ==================================================

User # 95 , distance: 0.0

Big Lebowski, The (1998) 3 YOUR: 3

===== RECOMMENDATIONS =============================================

Chicken Run (2000) 5.0

Auntie Mame (1958) 5.0

Muppet Movie, The (1979) 5.0

'Night Mother (1986) 5.0

Goldfinger (1964) 5.0

Children of Paradise (Les enfants du paradis) (1945) 5.0

Total Recall (1990) 5.0

Boys Don't Cry (1999) 5.0

Radio Days (1987) 5.0

Distance measure: RMSE

Ideal Husband, An (1999) 5.0

Obvious problem: ratings agree perfectly,

Red Violin, The (Le Violon rouge) (1998) 5.0

but there are too few common ratings. More

57

ratings mean greater chance of disagreement. - RMSE 2.0

def lmg_rmse(rating1, rating2):

max_rating = 5.0

sum = 0

count = 0

for (key, rating) in rating1.items():

if key in rating2:

sum += (rating2[key] - rating) ** 2

count += 1

if not count:

return 1000000 # no common ratings, so distance

is huge

return sqrt(sum / float(count)) + (max_rating / count)

58 - Output, k=3, RMSE 2.0

===== 0 ==================================================

User # 3320 , distance: 1.09225018729

Highlander III: The Sorcerer (1994) 1 YOUR: 1

Boxing Helena (1993) 1 YOUR: 1

Much better choice of users

Pretty Woman (1990) 2 YOUR: 2

Close Shave, A (1995) 5 YOUR: 5

But all recommended movies are 5.0

Michael Collins (1996) 4 YOUR: 4

Wrong Trousers, The (1993) 5 YOUR: 5

Basically, if one user gave it 5.0, that’s

Amistad (1997) 4 YOUR: 3

going to beat 5.0, 5.0, and 4.0

===== 1 ==================================================

User # 2825 , distance: 1.24880819811

Clearly, we need to reward movies that

Amistad (1997) 3 YOUR: 3

English Patient, The (1996) 4 YOUR: 5

have more ratings somehow

Wrong Trousers, The (1993) 5 YOUR: 5

Death and the Maiden (1994) 5 YOUR: 5

Lawrence of Arabia (1962) 4 YOUR: 4

Close Shave, A (1995) 5 YOUR: 5

Piano, The (1993) 5 YOUR: 4

===== 2 ==================================================

User # 1205 , distance: 1.41068360252

Sliding Doors (1998) 4 YOUR: 3

English Patient, The (1996) 4 YOUR: 5

Michael Collins (1996) 4 YOUR: 4

Close Shave, A (1995) 5 YOUR: 5

Wrong Trousers, The (1993) 5 YOUR: 5

Piano, The (1993) 4 YOUR: 4

===== RECOMMENDATIONS ==================================================

Patriot, The (2000) 5.0

Badlands (1973) 5.0

Blood Simple (1984) 5.0

Gold Rush, The (1925) 5.0

Mission: Impossible 2 (2000) 5.0

Gladiator (2000) 5.0

Hook (1991) 5.0

Funny Bones (1995) 5.0

Creature Comforts (1990) 5.0

Do the Right Thing (1989) 5.0

59

Thelma & Louise (1991) 5.0 - Bayesian average

• A simple weighted average that

accounts for how many ratings there

are

• Basically, you take the set of ratings

and add n extra “fake” ratings of the

average value

• So for movies, we use the average of

>>> avg([5.0], 2)

3.0

3.6666666666666665

>>> avg([5.0, 5.0], 2)

(sum(numbers) + (3.0 *

4.0

n))

float(len(numbers) + n)

>>> avg([5.0, 5.0, 5.0], 2)

4.2

>>> avg([5.0, 5.0, 5.0, 5.0], 2)

60

4.333333333333333 - With k=3

===== RECOMMENDATIONS ===============

Truman Show, The (1998) 4.2

Say Anything... (1989) 4.0

Jerry Maguire (1996) 4.0

Groundhog Day (1993) 4.0

Monty Python and the Holy Grail (1974) 4.0

Big Night (1996) 4.0

Babe (1995) 4.0

What About Bob? (1991) 3.75

Howards End (1992) 3.75

Not very good, but k=3 makes us

Winslow Boy, The (1998) 3.75

very dependent on those specific 3

Shakespeare in Love (1998) 3.75 users.

61 - With k=10

Definitely better.

===== RECOMMENDATIONS ===============

Groundhog Day (1993) 4.55555555556

Annie Hall (1977) 4.4

One Flew Over the Cuckoo's Nest (1975) 4.375

Fargo (1996) 4.36363636364

Wallace & Gromit: The Best of Aardman Animation

(1996) 4.33333333333

Do the Right Thing (1989) 4.28571428571

Princess Bride, The (1987) 4.28571428571

Welcome to the Dollhouse (1995) 4.28571428571

Wizard of Oz, The (1939) 4.25

Blood Simple (1984) 4.22222222222

Rushmore (1998) 4.2

62 - With k=50

===== RECOMMENDATIONS ===============

Wallace & Gromit: The Best of Aardman Animation

(1996) 4.55

Roger & Me (1989) 4.5

Waiting for Guffman (1996) 4.5

Grand Day Out, A (1992) 4.5

Creature Comforts (1990) 4.46666666667

Fargo (1996) 4.46511627907

Godfather, The (1972) 4.45161290323

Raising Arizona (1987) 4.4347826087

City Lights (1931) 4.42857142857

Usual Suspects, The (1995) 4.41666666667

Manchurian Candidate, The (1962) 4.41176470588

63 - With k = 2,000,000

• If we did that, what results would we

get?

64 - Normalization

• People use the scale differently

– some give only 4s and 5s

– others give only 1s

– some give only 1s and 5s

– etc

• Should have normalized user ratings

before using them

– before comparison

– and before averaging ratings from

neighbours

65 - Naïve Bayes

66 - Bayes’s Theorem

• Basically a theorem for combining

probabilities

– I’ve observed A, which indicates H is true

with probability 70%

– I’ve also observed B, which indicates H is

true with probability 85%

– what should I conclude?

• Naïve Bayes is basically using this

theorem

– with the assumption that A and B are

indepedent

– this assumption is nearly always false, hence

“naïve”

67 - Simple example

• Is the coin fair or not?

– we throw it 10 times, get 9 heads and one tail

– we try again, get 8 heads and two tails

• What do we know now?

– can combine data and recompute

– or just use Bayes’s Theorem directly

>>> compute_bayes([0.92, 0.84])

0.9837067209775967

68

http://www.bbc.co.uk/news/magazine-

22310186 - Ways I’ve used Bayes

• Duke

– record deduplication engine

– estimate probability of duplicate for each property

– combine probabilities with Bayes

• Whazzup

– news aggregator that finds relevant news

– works essentially like spam classifier on next slide

• Tine recommendation prototype

– recommends recipes based on previous choices

– also like spam classifier

• Classifying expenses

– using export from my bank

– also like spam classifier

69 - Bayes against spam

• Take a set of emails, divide it into spam

and non-spam (ham)

– count the number of times a feature appears in

each of the two sets

– a feature can be a word or anything you please

• To classify an email, for each feature in it

– consider the probability of email being spam given

that feature to be (spam count) / (spam count +

ham count)

– ie: if “viagra” appears 99 times in spam and 1 in

ham, the probability is 0.99

• Then combine the probabilities with Bayes

70 http://www.paulgraham.com/spam.html - Running the script

• I pass it

– 1000 emails from my Bouvet folder

– 1000 emails from my Spam folder

• Then I feed it

– 1 email from another Bouvet folder

– 1 email from another Spam folder

71 - Code

# scan spam

for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]:

for token in featurize(spam):

corpus.spam(token)

# scan ham

for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]:

for token in featurize(ham):

corpus.ham(token)

# compute probability

for email in sys.argv[3 : ]:

print email

p = classify(email)

if p < 0.2:

print ' Spam', p

else:

print ' Ham', p

72

https://github.com/larsga/py-snippets/tree/master/machine-

learning/spam - Classify

class Feature:

def __init__(self, token):

self._token = token

self._spam = 0

self._ham = 0

def spam(self):

self._spam += 1

def ham(self):

self._ham += 1

def spam_probability(self):

return (self._spam + PADDING) / float(self._spam + self._ham + (PADDING * 2))

def compute_bayes(probs):

product = reduce(operator.mul, probs)

lastpart = reduce(operator.mul, map(lambda x: 1-x, probs))

if product + lastpart == 0:

return 0 # happens rarely, but happens

else:

return product / (product + lastpart)

def classify(email):

return compute_bayes([corpus.spam_probability(f) for f in featurize(email)])

73 - Ham output

So, clearly most of the spam

Ham 1.0

is from March 2013...

Received:2013

0.00342935528121

Date:2013

0.00624219725343

<br

0.0291715285881

background-color: 0.03125

background-color: 0.03125

background-color: 0.03125

background-color: 0.03125

background-color: 0.03125

Received:Mar

0.0332667997339

Date:Mar

0.0362756952842

...

Postboks

0.998107494322

Postboks

0.998107494322

Postboks

0.998107494322

+47

0.99787414966

+47

0.99787414966

+47

0.99787414966

+47

0.99787414966

Lars

0.996863237139

Lars

0.996863237139

23

0.995381062356

74 - Spam output

...and the ham from October 2012

Spam 2.92798502037e-16

Received:-0400

0.0115646258503

Received:-0400

0.0115646258503

Received-SPF:(ontopia.virtual.vps-host.net:

0.0135823429542

Received-SPF:receiver=ontopia.virtual.vps-host.net; 0.0135823429542

Received:<larsga@ontopia.net>;

0.0139318885449

Received:<larsga@ontopia.net>;

0.0139318885449

Received:ontopia.virtual.vps-host.net

0.0170863309353

Received:(8.13.1/8.13.1)

0.0170863309353

Received:ontopia.virtual.vps-host.net

0.0170863309353

Received:(8.13.1/8.13.1)

0.0170863309353

...

Received:2012

0.986111111111

Received:2012

0.986111111111

$

0.983193277311

Received:Oct

0.968152866242

Received:Oct

0.968152866242

Date:2012

0.959459459459

20

0.938864628821

+

0.936526946108

+

0.936526946108

+

0.936526946108

75 - More solid testing

• Using the SpamAssassin public

corpus

• Training with 500 emails from

– spam

– easy_ham (2002)

• Test results

– spam_2: 1128 spam, 269 misclassified as

ham

– easy_ham 2003: 2283 ham, 217 spam

• Results are pretty good for 30

minutes of effort...

76

http://spamassassin.apache.org/publiccorpus/ - Linear regression

77 - Linear regression

• Let’s say we have a number of

numerical parameters for an object

• We want to use these to predict some

other value

• Examples

– estimating real estate prices

– predicting the rating of a beer

– ...

78 - Estimating real estate prices

• Take parameters

– x1 square meters

– x2 number of rooms

– x3 number of floors

– x4 energy cost per year

– x5 meters to nearest subway station

– x6 years since built

– x7 years since last refurbished

– ...

• a x1 + b x2 + c x3 + ... = price

– strip out the x-es and you have a vector

– collect N samples of real flats with prices =

matrix

– welcome to the world of linear algebra

79 - Our data set: beer ratings

• Ratebeer.com

– a web site for rating beer

– scale of 0.5 to 5.0

• For each beer we know

– alcohol %

– country of origin

– brewery

– beer style (IPA, pilsener, stout, ...)

• But ... only one attribute is numeric!

– how to solve?

80 - Example

ABV

.se

.nl

.us

.uk

IIPA

Blac

Pale Bitte Ratin

k IPA ale

r

g

8.5

1.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

3.5

8.0

0.0

1.0

0.0

0.0

0.0

1.0

0.0

0.0

3.7

6.2

0.0

0.0

1.0

0.0

0.0

0.0

1.0

0.0

3.2

4.4

0.0

0.0

0.0

1.0

0.0

0.0

0.0

1.0

3.2

...

...

...

...

...

...

...

...

...

...

Basically, we turn each category into a column of 0.0 or 1.0 values.

81 - Normalization

• If some columns have much bigger values

than the others they will automatically

dominate predictions

• We solve this by normalization

• Basically, all values get resized into the

0.0-1.0 range

• For ABV we set a ceiling of 15%

– compute with min(15.0, abv) / 15.0

82 - Adding more data

• To get a bit more data, I added manually

a description of each beer style

• Each beer style got a 0.0-1.0 rating on

– colour (pale/dark)

– sweetness

– hoppiness

– sourness

• These ratings are kind of coarse because

all beers of the same style get the same

value

83 - Making predictions

• We’re looking for a formula

– a * abv + b * .se + c * .nl + d * .us + ... = rating

• We have n examples

– a * 8.5 + b * 1.0 + c * 0.0 + d * 0.0 + ... = 3.5

• We have one unknown per column

– as long as we have more rows than columns we

can solve the equation

• Interestingly, matrix operations can be

used to solve this easily

84 - Matrix formulation

• Let’s say

– x is our data matrix

– y is a vector with the ratings and

– w is a vector with the a, b, c, ... values

• That is: x * w = y

– this is the same as the original equation

– a x + b x + c x + = rating

1

2

3

...

• If we solve this, we get

85 - Enter Numpy

• Numpy is a Python library for matrix

operations

• It has built-in types for vectors and

matrices

• Means you can very easily work with

matrices in Python

• Why matrices?

– much easier to express what we want to do

– library written in C and very fast

– takes care of rounding errors, etc

86 - Quick Numpy example

>>> from numpy import *

>>> range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> [range(10)] * 10

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

>>> m = mat([range(10)] * 10)

>>> m

matrix([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

>>> m.T

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],

[4, 4, 4, 4, 4, 4, 4, 4, 4, 4],

[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],

[6, 6, 6, 6, 6, 6, 6, 6, 6, 6],

[7, 7, 7, 7, 7, 7, 7, 7, 7, 7],

[8, 8, 8, 8, 8, 8, 8, 8, 8, 8],

[9, 9, 9, 9, 9, 9, 9, 9, 9, 9]])

87 - Numpy solution

• We load the data into

– a list: scores

– a list of lists: parameters

• Then:

x_mat = mat(parameters)

y_mat = mat(scores).T

x_tx = x_mat.T * x_mat

assert linalg.det(x_tx)

ws = x_tx.I * (x_mat.T * y_mat)

88 - Does it work?

• We only have very rough information

about each beer (abv, country, style)

– so very detailed prediction isn’t possible

– but we should get some indication

• Here are the results based on my ratings

– 10% imperial stout from US 3.9

– 4.5% pale lager from Ukraine

2.8

– 5.2% German schwarzbier

3.1

– 7.0% German doppelbock

3.5

89

http://www.ratebeer.com/user/15206/ratings/ - Beyond prediction

• We can use this for more than just prediction

• We can also use it to see which columns

contribute the most to the rating

– that is, which aspects of a beer best predict the rating

• If we look at the w vector we see the following

– Aspect LMG grove

– ABV 0.56 1.1

– colour

0.46 0.42

– sweetness 0.25 0.51

– hoppiness 0.45 0.41

– sourness 0.29 0.87

• Could also use correlation

90 - Did we underfit?

• Who says the relationship between

ABV and the rating is linear?

– perhaps very low and very high ABV are both

negative?

– we cannot capture that with linear regression

• Solution

– add computed columns for parameters

raised to higher powers

– abv2, abv3, abv4, ...

– beware of overfitting...

91 - Scatter plot

Rating

Freeze-distilled Brewdog beers

92

ABV in %

Code in Github, requires matplotlib - Trying again

93 - Matrix factorization

• Another way to do recommendations

is matrix factorization

– basically, make a user/item matrix with

ratings

– try to find two smaller matrices that, when

multiplied together, give you the original

matrix

– that is, original with missing values filled in

• Why that works?

– I don’t know

– I tried it, couldn’t get it to work

– therefore we’re not covering it

– known to be a very good method, however

94 - Clustering

95 - Clustering

• Basically, take a set of objects and

sort them into groups

– objects that are similar go into the same

group

• The groups are not defined

beforehand

• Sometimes the number of groups to

create is input to the algorithm

• Many, many different algorithms for

this

96 - Sample data

• Our sample data set is data about aircraft from

DBpedia

• For each aircraft model we have

– name

– length (m)

– height (m)

– wingspan (m)

– number of crew members

– operational ceiling, or max height (m)

– max speed (km/h)

– empty weight (kg)

• We use a subset of the data

– 149 aircraft models which all have values for all of

these properties

• Also, all values normalized to the 0.0-1.0 range

97 - Distance

• All clustering algorithms require a

distance function

– that is, a measure of similarity between two

objects

• Any kind of distance function can be

used

– generally, lower values mean more similar

• Examples of distance functions

– metric distance

– vector cosine

– RMSE

– ...

98 - k-means clustering

• Input: the number of clusters to create (k)

• Pick k objects

– these are your initial clusters

• For all objects, find nearest cluster

– assign the object to that cluster

• For each cluster, compute mean of all

properties

– use these mean values to compute distance to

clusters

– the mean is often referred to as a “centroid”

– go back to previous step

• Continue until no objects change cluster

99 - First attempt at aircraft

• We leave out name and number built when

doing comparison

• We use RMSE as the distance measure

• We set k = 5

• What happens?

– first iteration: all 149 assigned to a cluster

– second: 11 models change cluster

– third: 7 change

– fourth: 5 change

– fifth: 5 change

– sixth: 2

– seventh: 1

– eighth: 0

100 - cluster5, 4 models

Cluster 5

3 jet bombers, one

ceiling : 13400.0

propeller bomber.

maxspeed : 1149.7

Not too bad.

crew : 7.5

length : 47.275

height : 11.65

emptyweight : 69357.5

wingspan : 47.18

The Myasishchev M-50 was a

The Myasishchev M-4 Molot is

Soviet prototype four-engine

a four-engined strategic

supersonic bomber which never

bomber

attained service

The Convair B-36 "Peacemaker”

was a strategic bomber built by

The Tupolev Tu-16 was a twin-

Convair and operated solely by the

engine jet bomber used by the

United States Air Force (USAF)

from 1949 to 1959

101

Soviet Union. - cluster4, 56 models

Small, slow propeller aircraft.

Cluster 4

ceiling : 5898.2

Not too bad.

maxspeed : 259.8

crew : 2.2

length : 10.0

height : 3.3

emptyweight : 2202.5

wingspan : 13.8

The Avia B.135 was a Czechoslovak

The Yakovlev UT-1 was a single-

cantilever monoplane fighter aircraft

seater trainer aircraft

The Siebel Fh 104 Hallore was a

small German twin-engined

transport, communications and

liaison aircraft

The Yakovlev UT-2 was a single-

The North American B-25

seater trainer aircraft

Mitchell was an American twin-

engined medium bomber

The Airco DH.2 was a single-

The Messerschmitt Bf 108 Taifun was

seat biplane "pusher" aircraft

a German single-engine sports and

touring aircraft

102 - cluster3, 12 models

Small, very fast jet

Cluster 3

ceiling : 16921.1

planes. Pretty good.

maxspeed : 2456.9

crew : 2.67

length : 17.2

height : 4.92

emptyweight :

9941

wingspan : 10.1

The Mikoyan MiG-29 is a

The English Electric Lightning is

fourth-generation jet fighter

a supersonic jet fighter aircraft

aircraft

of the Cold War era, noted for its

great speed.

The Northrop T-38 Talon is a

two-seat, twin-engine

supersonic jet trainer

The Vought F-8 Crusader was

The Dassault Mirage 5 is a

a single-engine, supersonic

supersonic attack aircraft

[fighter] aircraft

The Mikoyan MiG-35 is a further

development of the MiG-29

103 - cluster2, 27 models

Biggish, kind of slow

Cluster 2

ceiling : 6447.5

planes.

maxspeed : 435

Some oddballs in this

crew : 5.4

group.

length : 24.4

height : 6.7

emptyweight :

16894

wingspan : 32.8

The Bartini Beriev VVA-14

The Fokker 50 is a turboprop-

(vertical take-off amphibious

powered airliner

aircraft)

The Junkers Ju 89 was a

heavy bomber

The Aviation Traders ATL-98

The PB2Y Coronado was a

Carvair was a large piston-

large flying boat patrol

engine transport aircraft.

bomber

The Beriev Be-200 Altair is a

multipurpose amphibious

104

The Junkers Ju 290 was a long-range

aircraft

transport, maritime patrol aircraft and

heavy bomber - cluster1, 50 models

Small, fast planes. Mostly

Cluster 1

ceiling : 11612

good, though the

maxspeed :

Canberra is a poor fit.

726.4

crew : 1.6

length : 11.9

height : 3.8

emptyweight :

5303

wingspan : 13

The Adam A700 AdamJet was a

The Curtiss P-36 Hawk was an

proposed six-seat civil utility

American-designed and built fighter

aircraft

aircraft

The English Electric Canberra

is a first-generation jet-

powered light bomber

The Heinkel

He 100 was a

German pre-

The Learjet 23 is a ... twin-

The Kawasaki Ki-61 Hien was a

World War II

engine, high-speed business jet

Japanese World War II fighter

aircraft

fighter aircraft

The Learjet 24 is a ... twin-

105engine, high-speed business jet

The Grumman F3F was the last

American biplane fighter aircraft - Clusters, summarizing

• Cluster 1: small, fast aircraft (750

km/h)

• Cluster 2: big, slow aircraft (450 km/h)

• Cluster 3: small, very fast jets (2500

km/h)

• Cluster 4: small, very slow planes (250

km/h) For a first attempt to sort through the data,

• Cluster 5: b

this is ig

not , bfa

ad satt j

all et planes (1150

km/h)

106 https://github.com/larsga/py-snippets/tree/master/machine-

learning/aircraft - Agglomerative clustering

• Put all objects in a pile

• Make a cluster of the two objects

closest to one another

– from here on, treat clusters like objects

• Repeat second step until satisfied

107 There is code for this, too, in the Github sample - Principal

component

analysis

108 - PCA

• Basically, using eigenvalue analysis to

find out which variables contain the

most information

– the maths are pretty involved

– and I’ve forgotten how it works

– and I’ve thrown out my linear algebra book

– and ordering a new one from Amazon takes

too long

– ...so we’re going to do this intuitively

109 - An example data set

• Two variables

• Three classes

• What’s the longest line we could

drawthrough the data?

• That line is a vector in two dimensions

• What dimension dominates?

– that’s right: the horizontal

– this implies the horizontal contains most of

the information in the data set

• PCA identifies the most significant

variables

110 - Dimensionality reduction

• After PCA we know which dimensions

matter

– based on that information we can decide to

throw out less important dimensions

• Result

– smaller data set

– faster computations

– easier to understand

111 - Trying out PCA

• Let’s try it on the Ratebeer data

• We know ABV has the most

information

– because it’s the only value specified for

each individual beer

• We also include a new column:

alcohol

– this is the amount of alcohol in a pint glass

of the beer, measured in centiliters

– this column basically contains no

information at all; it’s computed from the

abv column

112 - Complete code

import rblib

from numpy import *

def eigenvalues(data, columns):

covariance = cov(data - mean(data, axis = 0), rowvar = 0)

eigvals = linalg.eig(mat(covariance))[0]

indices = list(argsort(eigvals))

indices.reverse() # so we get most significant first

return [(columns[ix], float(eigvals[ix])) for ix in indices]

(scores, parameters, columns) =

rblib.load_as_matrix('ratings.txt')

for (col, ev) in eigenvalues(parameters, columns):

print "%40s %s" % (col, float(ev))

113 - Output

abv 0.184770392185

colour 0.13154093951

sweet 0.121781685354

hoppy 0.102241100597

sour 0.0961537687655

alcohol 0.0893502031589

United States 0.0677552513387

....

Eisbock -3.73028421245e-18

Belarus -3.73028421245e-18

Vietnam -1.68514561515e-17

114 - MapReduce

115 - University pre-lecture, 1991

• My first meeting with university was Open

University Day, in 1991

• Professor Bjørn Kirkerud gave the computer

science talk

• His subject

– some day processors will stop becoming faster

– we’re already building machines with many

processors

– what we need is a way to parallelize software

– preferably automatically, by feeding in normal

source code and getting it parallelized back

• MapReduce is basically the state of the art

on that today

116 - MapReduce

• A framework for writing massively

parallel code

• Simple, straightforward model

• Based on “map” and “reduce”

functions from functional

programming (LISP)

117 - http://research.google.com/archive/mapreduce.html

Appeared in:

OSDI'04: Sixth Symposium on Operating System

Design and Implementation,

San Francisco, CA, December, 2004.

118 - map and reduce

>>> "1 2 3 4 5 6 7 8".split()

['1', '2', '3', '4', '5', '6', '7', '8']

>>> l = map(int, "1 2 3 4 5 6 7 8".split())

>>> l

[1, 2, 3, 4, 5, 6, 7, 8]

>>> import operator

>>> reduce(operator.add, l)

36

119 - MapReduce

1. Split data into fragments

2. Create a Map task for each fragment

–

the task outputs a set of (key, value) pairs

3. Group the pairs by key

4. Call Reduce once for each key

–

all pairs with same key passed in together

–

reduce outputs new (key, value) pairs

Tasks get spread out over worker nodes

Master node keeps track of completed/failed tasks

Failed tasks are restarted

Failed nodes are detected and avoided

Also scheduling tricks to deal with slow nodes

120 - Communications

• HDFS

– Hadoop Distributed File System

– input data, temporary results, and results are

stored as files here

– Hadoop takes care of making files available

to nodes

• Hadoop RPC

– how Hadoop communicates between nodes

– used for scheduling tasks, heartbeat etc

• Most of this is in practice hidden from

the developer

121 - Does anyone need

MapReduce?

• I tried to do book recommendations

with linear algebra

• Basically, doing matrix multiplication to

produce the full user/item matrix with

blanks filled in

• My Mac wound up freezing

• 185,973 books x 77,805 users =

14,469,629,265

– assuming 2 bytes per float = 28 GB of RAM

• So it doesn’t necessarily take that

much to have some use for MapReduce

122 - The word count example

• Classic example of using MapReduce

• Takes an input directory of text files

• Processes them to produce word

frequency counts

• To start up, copy data into HDFS

– bin/hadoop dfs -mkdir <hdfs-dir>

– bin/hadoop dfs -copyFromLocal <local-dir>

<hdfs-dir>

123 - WordCount – the mapper

public static class Map extends Mapper<LongWritable,

Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context

context) {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

By default, Hadoop will scan all text files in input directory

}

Each line in each file will become a mapper task

And thus a “Text value” input to a map() call

124 - WordCount – the reducer

public static class Reduce extends

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key,

Iterable<IntWritable> values, Context context)

{

int sum = 0;

for (IntWritable val : values)

sum += val.get();

context.write(key, new IntWritable(sum));

}

}

125 - The Hadoop ecosystem

• Pig

– dataflow language for setting up MR jobs

• HBase

– NoSQL database to store MR input in

• Hive

– SQL-like query language on top of Hadoop

• Mahout

– machine learning library on top of Hadoop

• Hadoop Streaming

– utility for writing mappers and reducers as

command-line tools in other languages

126 - Word count in HiveQL

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;

-- temporary table to hold words...

CREATE TABLE words (word STRING);

add file splitter.py;

INSERT OVERWRITE TABLE words

SELECT TRANSFORM(text)

USING 'python splitter.py'

AS word

FROM input;

SELECT word, COUNT(*)

FROM input

LATERAL VIEW explode(split(text, ' ')) lTable as word

GROUP BY word;

127 - Word count in Pig

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS

(line:chararray);

-- Extract words from each line and put them into a pig bag

-- datatype, then flatten the bag to get one word on each row

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces

filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS

count, group AS word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

128 - Applications of MapReduce

• Linear algebra operations

– easily mapreducible

• SQL queries over heterogeneous data

– basically requires only a mapping to tables

– relational algebra easy to do in MapReduce

• PageRank

– basically one big set of matrix

multiplications

– the original application of MapReduce

• Recommendation engines

– the SON algorithm

• ...

129 - Apache Mahout

• Has three main application areas

– others are welcome, but this is mainly what’s

there now

• Recommendation engines

– several different similarity measures

– collaborative filtering

– Slope-one algorithm

• Clustering

– k-means and fuzzy k-means

– Latent Dirichlet Allocation

• Classification

– stochastic gradient descent

– Support Vector Machines

– Naïve Bayes

130 - SQL to relational algebra

select lives.person_name, city

from works, lives

where company_name = ’FBC’ and

works.person_name =

lives.person_name

131 - Translation to MapReduce

• σ(company_name=‘FBC’, works)

– map: for each record r in works, verify the

condition, and pass (r, r) if it matches

– reduce: receive (r, r) and pass it on unchanged

• π(person_name, σ(...))

– map: for each record r in input, produce a new

record r’ with only wanted columns, pass (r’, r’)

– reduce: receive (r’, [r’, r’, r’ ...]), output (r’, r’)

• ⋈(π(...), lives)

– map:

• for each record r in π(...), output (person_name, r)

• for each record r in lives, output (person_name, r)

– reduce: receive (key, [record, record, ...]), and

perform the actual join

• ...

132 - Lots of SQL-on-MapReduce

tools

• Tenzing

Google

• Hive

Apache Hadoop

• YSmart

Ohio State

• SQL-MR

AsterData

• HadoopDB Hadapt

• Polybase

Microsoft

• RainStor

RainStor Inc.

• ParAccel

ParAccel Inc.

• Impala

Cloudera

• ...

133 - Conclusion

134 - Big data & machine learning

• This is a huge field, growing very fast

• Many algorithms and techniques

– can be seen as a giant toolbox with wide-

ranging applications

• Ranging from the very simple to the

extremely sophisticated

• Difficult to see the big picture

• Huge range of applications

• Math skills are crucial

135