Having a healthy relaUonship with the interpreter • The Python interpreter itself is “slow”, as compared with hand- ‐coded C or Java • Each line of Python code may feature mulUple internal C API calls, temporary data structures, etc.
• Python built- ‐in data structures (numbers, strings, tuples, lists, dicts, etc.) have signiﬁcant memory and performance use overhead
But it’s much easier to write 100% Python! • Building hybrid C/C++ and Python systems adds a lot of complexity to the engineering process • (but it’s o]en worth it) • See: Cython, SWIG, Boost.Python, Pybind11, and other “hybrid” so]ware creaUon tools
It’s Al About the Benjamins (Data Structures) • The hard currency of data so]ware is: in- ‐memory data structures • How costly are they to send and receive? • How costly to manipulate and munge in- ‐memory? • How diﬃcult is it to add new proprietary computaUon logic?
What’s this have to do with Spark? • Some known performance issues in PySpark • IO throughput • Python to Spark • Spark to Python (or Python extension code) • Running interpreted Python code on RDDs / Spark DataFrames • Lambda mappers / reducers (rdd.map(...)) • Spark SQL UDFs (registerFuncUon(...))
Apache Arrow in a Slide Calcite Cassandra • New Top- ‐level Apache So]ware FoundaUon project Deeplearning4j • hlp://arrow.apache.org Dril
Hadoop • Focused on Columnar In- ‐Memory AnalyUcs HBase 1. 10- ‐100x speedup on many workloads Ibis 2. Common data layer enables companies to choose best of breed systems Impala Kudu 3. Designed to work with any programming language 4. Pandas Support for both relaUonal and complex data as- ‐is
Arrow and PySpark • Build a C API level data protocol to move data between Spark and Python • Either • (Fast) Convert Arrow to/from pandas.DataFrame • (Faster) Perform naUve analyUcs on Arrow data in- ‐memory • Use Arrow • For eﬃciently handling nested Spark SQL data in- ‐memory • IO: pandas/NumPy data push/pul • Lambda/UDF evaluaUon