Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
Works of 2015 Available Jan, 2016 ?
Works of past
Motivation for today’s talk I want to deal with my ‘Big’ data, WITH PYTHON !!
Apache Spark You may already have heard a lot Fast, distributed data processing Spark framework with Hive e.t.c. Impala （Spark Streaming, MLlib,
e.t.c（in- GraphX, Spark SQL) high-level APIs MapReduce memory SQL HBase engine） YARN HDFS Written in Scala, OS run in JVM
Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time MapReduce Spark JVM Invocation Executor（JVM）Invocation I/O map I/0 f1（read data to RDD） access Memory (RDDs) JVM Invocation f2 access reduce I/0 HDFS access HDFS f3 JVM Invocation f4（persist to storage） access I/O map I/0 f5（does shuffle） I/O f6 access JVM Invocation f7 access reduce I/0
Apache Spark and non-JVM languages
Spark supports non-JVM languages Shells PySpark, for Python users SparkR, for R users GUI Environment : Jupiter, RStudio You can write application code in these languages
The Web UI tells us a lot http://<address>:4040
Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
DataFrame APIs come to the rescue !
DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
Do it within JVM Only code goes When you call DataFrame APIs from through non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages
DataFrame APIs compared to RDD APIs by Examples JVM Python Ex DataFrame, transfer e Cached cu iver lambda items: t items == ‘abc’ o Dr r DataFrame, r e sf n a r t result
DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Ex Cached e iver t c ran u filter(df[“_1”] Dr sf == “abc”) t e o r r DataFrame, result
Watch out for UDFs You can write UDFs slen = udf( in Python lambda s: len(s), IntegerType()) You can use lambdas in Python, df.select( too slen(df.name)) .collect() Once you use them, data flows between the two worlds
Make it small first, then use UDFs Filter or sample your ‘big’ data with ‘BIG’ data DataFrame APIs in DataFrame Then use UDFs filtering with ‘native APIs’ SQL optimizer does ‘Small’ data in DataFrame not take it into whatever account when making operation with plans (so far) UDFs
Make it small first, then use UDFs Filter or sample your slen = udf( ‘big’ data with lambda s: len(s), DataFrame APIs IntegerType()) sqc.SQL( Then use UDFs ‘select… from df SQL optimizer does where fname like “tama%” not take it into and slen(name)’ ).collect() account when making plans (so far) processed first !
Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Ex Driver Machine JVM e cu DataFrame Driver Py4J tor Local Data HDFS (Parquet)
Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Ex Driver Machine JVM e cu code only code only DataFrame Driver Py4J tor HDFS (Parquet)
Appendix : Parquet
Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware