Why Spark is fast(er) • Whom do we compare to? • What do we mean by fast? • fast to write • fast to run
Why Spark is fast(er) cont. • But the figure in previous page is some how misleading. • The key is the flexible programming mode. • Which lead to more reasonable data flow. • Which lead to less IO operation. • Especial y for iterative heavy workloads like ML. • Which potential y cut off a lot of shuffle operations needed. • But, you won’t always be lucky. • Many app logic did need to exchange a lot of data. • In the end, you wil stil need to deal with shuffle • And which usual y impact performance a lot.
How does shuffle come into the picture • Spark run job stage by stage. • Stages are build up by DAGScheduler according to RDD’s ShuffleDependency • e.g. ShuffleRDD / CoGroupedRDD wil have a ShuffleDependency • Many operator wil create ShuffleRDD / CoGroupedRDD under the hook. • Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup • many other operator wil further cal into the above operators • e.g. various join operator wil cal cogroup. • Each ShuffleDependency maps to one stage in Spark Job and then wil lead to a shuffle.
So everyone should have seen this before A: B: G: Stage 1 groupBy C: D: F: map E: join Stage 2 union Stage 3
why shuffle is expensive • When doing shuffle, data no longer stay in memory only • For spark, shuffle process might involve • data partition: which might involve very expensive data sorting works etc. • data ser/deser: to enable data been transfer through network or across processes. • data compression: to reduce IO bandwidth etc. • DISK IO: probably multiple times on one single data block • E.g. Shuffle Spil , Merge combine
History • Spark 0.6-0.7, same code path with RDD’s persistent method, can choose MEMORY_ONLY and DISK_ONLY (default). • Spark 0.8-0.9: • separate shuffle code path from BM and create ShuffleBlockManager and BlockObjectWriter only for shuffle, now shuffle data can only be written to disk. • Shuffle optimization: Consolidate shuffle write. • Spark 1.0, pluggable shuffle framework. • Spark 1.1, sort-based shuffle implementation. • Spark 1.2 netty transfer service reimplementation. sort- based shuffle by default • Spark 1.2+ on the go: external shuffle service etc.
13 Pluggable Shuffle Framework • ShuffleManager • Manage shuffle related components, registered in SparkEnv, configured through SparkConf, default is sort (pre 1.2 is hash), • ShuffleWriter • Handle shuffle data output logics. Wil return MapStatus to be tracked by MapOutputTracker. • ShuffleReader • Fetch shuffle data to be used by e.g. ShuffleRDD • ShuffleBlockManager • Manage the mapping relation between abstract bucket and materialized data block.
High level data flow GetBlockData BlockManager BlockTransferService GetBlockData HashShuffleManager SortShuffleManager FileShuffleBlockManager IndexShuffleBlockManager Direct mapping or Map to One Data File and mapping by File Groups One Index File per mapId DiskBlockManager Just do one-one File mapping Local File System
Hash Based Shuffle - Shuffle Writer • Basic shuffle writer Map Task Map Task Map Task Map Task Aggregator Aggregator Aggregator Aggregator File File File File File File File File File File File File File File File File Each bucket is mapping to a single file 15
16 Hash Based Shuffle - Shuffle Writer • Consolidate Shuffle Writer Aggregator Aggregator Aggregator Aggregator Each bucket is mapping to a segment of file
17 Hash Based Shuffle - Shuffle Writer • Basic Shuffle Writer • M * R shuffle spil files • Concurrent C * R opened shuffle files. • If shuffle spil enabled, could generate more tmp spil files say N.
• Consolidate Shuffle Writer • Reduce the total spil ed files into C * R if (M >> C) • Concurrent opened is the same as the basic shuffle writer.
• Memory consumption • Thus Concurrent C * R + N file handlers. • Each file handler could take up to 32~100KB+ Memory for various buffers across the writer stream chain.
19 Sort Based Shuffle - Shuffle Writer • Each map task generates 1 shuffle data file + 1 index file • Utilize ExternalSorter to do the sort works. • If map-side combine is required, data wil be sorted by key and partition for aggregation. Otherwise data wil only be sorted by partition. • If reducer number <= 200 and no need to do aggregation or ordering, data wil not be sorted at al . • Wil go with hash way and spil to separate files for each reduce partition, then merge them into one per map for final output.
20 Hash Based Shuffle - Shuffle Reader Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Aggregator Aggregator Aggregator Aggregator Re R du d c u e Tas a k s Reduce Task Reduce Task Reduce Task • Actual y, at present, Sort Based Shuffle also go with HashShuffleReader
BLOCK TRANSFER SERVICE
Related conceptions • BlockTransferService • Provide a general interface for ShuffleFetcher and working with BlockDataManager to get local data. • ShuffleClient • Wrap up the fetching data process for the client side, say setup TransportContext, new TransportClient etc. • TransportContext • Context to setup the transport layer • TransportServer • low-level streaming service server • TransportClient • Client for fetching consecutive chunks TransportServer
Local Remote Data Flow ShuffleManager ShuffleManager
fetch GetBlockData BlockStoreShuffleFetcher BlockManager Local Blocks ShuffleBlockFetcherIterator BlockDataManager GetBlockData Remote Blocks NioBlockTransferService NioBlockTransferService GetBlock Block Manager ConnectionManager ConnectionManager GotBlock Can Switch to different BlockTransferService 23
Local Remote Data Flow ShuffleManager ShuffleManager
External Shuffle Service • Design goal • al ow for the service to be long-running • possibly much longer-running than Spark • support multiple version of Spark simultaneously etc. • can be integrated into YARN NodeManager, Standalone Worker, or on its own • The entire service been ported to Java • do not include Spark's dependencies • ful control over the binary compatibility of the components • not depend on the Scala runtime or version.
External Shuffle Service • Current Status • Basic framework seems ready. • A Network module extracted from the core module • BlockManager could be configured with executor built-in shuffle service or external standalone shuffle service • A standaloneWorkerShuffleService could be launched by worker • Disabled by default. • How it works • Shuffle data is stil written by the shuffleWriter to local disks. • The external shuffle service knows how to read these files on disks (executor wil registered related info to it, e.g. shuffle manager type, file dir layout etc.), it fol ow the same rules applied for written these file, so it could serve the data correctly.
28 Sort Merge Shuffle Reader • Background: • Current HashShuffleReader does not utilize the sort result within partition in map-side. • The actual by key sort work is always done at reduce side. • While the map side wil do by-partition sort anyway （ sort shuffle ) • Change it to a by-key-and-partition sort does not bring many extra overhead. • Current Status • [WIP] https://github.com/apache/spark/pul /3438
What’s next? • Other custom shuffle logic? • Alternative way to save shuffle data blocks • E.g. in memory (again) • Other transport mechanism? • Break stage barrier? • To fetch shuffle data when part of the map tasks are done. • Push mode instead of pul mode?
Thanks to Jerry Shao • Some of this ppt’s material came from Jerry Shao@Intel weibo: @saisai_shao • Jerry also contributes a lot of essential patches for spark core / spark streaming etc.