Apache Spark is a fast and general engine for distributed computing & big data processing with AP...
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark’s core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark’s unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark’s libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.