Overview ●Motivation ●Spark Implementation ○ Col abrative Filtering ○ Data Frames ○ BLAS-3 ●Results and lessons learnt.
Motivation ●App discovery is a chal enging problem due to the exponential growth in number of apps ●Over 1.5 mil ion apps available through both market places (i.e. Itunes and Google Play store) ●Develop app recommendation engine using various user behavior signals ○Explicit Signal (App rating) ○Implicit Signal (frequency/duration of app usage)
Flurry Data and Summary ●Data available through Flurry SDK is rich in both coverage and depth ●Col ected session length for Apps used on IOS platform in period between Sept 1-15 2015 . ●Restricted analysis to Apps used by 100 or more users ○~496 mil ion Users ○~53,793 Apps
Data Summary ●User Count : 496,508,312 ●App Count : 153,773 ●App 100+ : 53,793 ●Train time : 52 minutes ●Predict time : 8 minutes
Our Approach ●Utilize a col aborative filtering based App recommendation ●Run col aborative filtering that works at scale to generate: ○Low dimension user features ○Low dimension App features ○Compute user x App rating for al possible combinations (26.7 Tril ion) ●Used spark framework to efficiently train and recommend.
Collaborative Filtering Model ●Projects the users and Apps (in our case) into a lower dimensional space
Model Fitting and Parameter Optimization ●Used out of sample prediction accuracy on 20+ Apps Users ●The MSE was minimum with number of factors fixed at 60
Data Frames ●Join operation can greatly benefit from caching. ●Filter out Apps that have less than 100 users cleandata = allapps.join(cleanapps) ●Do a replicated join in Spark #only keep the apps that had 100 or more user cleanapps = myapps.filter(lambda x :x > MAXAPPS).map(lambda x: int(x)) #persist the apps data apps = sc.broadcast(set(cleanapps.collect())) # filter by the data set: I have simulated a replicated join cleandata = allapps.filter(lambda x: x in apps.value)
Data Frames ●In spark you can use a dataframe directly Record = Row("userId", "iuserId", "appId", "value") MAXAPPS = 100 #transform allapps to a df allappsdf = allapps.map(lambda x: Record(*x)).toDF() # register the DF and issue SQL queries sqlContext.registerDataFrameAsTable(allappdf, "table1") #here I am grouing by the AppID df2 = sqlContext.sql("SELECT appId as appId2, avg(value), count(*) from table1 group by appId") topappsdf = df2.filter(df2.c2 >MAXAPPS) #DF join cleandata = allappsdf.join(topappsdf, allapps.appId == topappdf.appId2)
BLAS 3 ●The number of possible user x App combinations is very large Default prediction : PredictAl ○predictions = model.predictAl(testdata).map(lambda r: ((r, r), r)) ○ Prediction is simply matrix multiplication of user “i” and App “j” ●Never completes and most of time spent on reshuffle. ●The users are not partioned so can be on al Nodes. ●The Apps are not partioned so can be on al Nodes. ●Reshuffle is extremely slow.
BLAS 3 ●The key is that the Number of Apps << Number of users ●Exploit the low number of Apps to optimize the prediction time
BLAS 3 ●The App features being smal er in size can be stored in primary memory (BLAS 3) ●We broadcast the Apps to al executors, which reduces the overal reshuffling of data ●use BLAS-3 matrix multiplication available within numpy which is highly optimized
BLAS 3 Basic linear algbera system for solving problems of the form D = a A * b B + c C Highly optimized for matrix multiplication.
BLAS 3 import numpy from numpy import * myModel=MatrixFactorizationModel.load(sc, "BingBong”) m1 = myModel.productFeatures() m2 = m1.map(lambda (product,feature) : feature).collect() m3 = matrix(m2).transpose() pf = sc.broadcast(m3) uf = myModel.userFeatures().coalesce(100) #get predictions on all user f1 = uf.map(lambda (userID, features): (userID, squeeze(asarray(matrix(array(features)) * pf.value))))
Evaluation :Predicted Score
Predicted Score : Positive
Predicted Score : Negative
Evaluation of Recommendation ● Identify users with high(low) scores ● Design of experiment : ● High score x Recommendation ● High score x Placebo ● Low score x Recommendation ● High score x Placebo
Future Work ● Spark econometrics library (std. error, robust std. errors.. ) ● Online experiments to measure value of recommendation . ● Experiments with various implicit ratings : ● number of sessions ● days used ● Log of days used