A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how...
A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.