Performance Comparison of Streaming Big Data Platforms Reza Farivar Capital One Inc. Kyle Knusbaum Yahoo Inc.
Streaming Computation engines • Designed to process a continuous stream of data. • Designed to process data with low latency – data (ideal y) doesn’t buffer up before being processed. Contrasts with batch processing - MapReduce. • Designed to handle big data. The systems are distributed by design.
• Apache Storm has the TopologyBuilder API to create a directed graph (topology) through which streams of data flow. • “Spouts” are the entry point to the graph, and “bolts” perform the processing. • Data flows through the system as individual tuples. • Graphs are not necessarily acyclic (although that is often the case) Tuple D Database a Kafka Spout ta Tuple
• Apache Flink has the DataStream API to perform operations on streams of data. (map, filter, reduce, join, etc.) • These operations are turned into a graph at job submission time by Flink. • Underlying graph works similarly to Storm’s model. • Also supports a Storm-compatible API Database
• Apache Spark has the DStream API to perform operations on streams of data. (map, filter, reduce, join, etc.) Based on Spark’s RDD (Resilient Distributed Dataset) abstraction. • Similar to Flink’s API. • Streaming accomplished through micro-batches. • Spark streaming job consists of one small batch after another. Spark Streaming RDD RDD RDD RDD RDD Database RDD RDD RDD RDD RDD RDD RDD RDD RDD RDD
Benchmark • We would like to compare the platforms, but which benchmark? – How to compare the relative effectiveness of these systems? • Throughput (events per second) • End-to-end latency (How long for an event to get through the system) • Completeness (Is the computation correct?) – Current benchmarks did not test with workloads similar to a real world use case • Speed of light tests only reveal so much information • So we created a new benchmark (on github) – A simple advertisement counting application – Mimic some common ETL operations on data streams
Our Streaming benchmark • Goal is to correlate latency with throughput. • Simulation of an advertisement analytics pipeline. • Must be implemented and run in all three engines. • Initial data: – Some number of advertising campaigns. – Some number of ads per campaign. • Initial data stored in Redis. • Our producers read the initial data, and start generating various events. (view, click, purchase) • Events are then sent to a Kafka cluster. Benchmark Event Producer
Flow of an event
Measuring Latency – Windows periodical y stored into Redis along with a timestamp of when the window was written into Redis. • Application given an SLA (Service-Level Agreement) as part of the simulation, demanding that tuples be processed in under 1 second. • The period of writes was chosen to meet the SLA. Writes to Redis were performed once per second. Spark is an exception. It wrote windows out once per batch
Measuring Latency Window data Last event written into • Ten second window 10 s 1st event in window Redis • First event generated in window • 10 seconds of events – 10’s of thousands of events per second • Last event generated near end of window • At some point later, the window is written into Redis. • We know the time of the end of the window, and the time the window was written. • This time gives us a data point of latency – length of time between event generation and being written in database. Latency data point • Events processed late will cause their windows to be (Ideal y less than SLA) written at a later time, and will be reflected in the data.
Our methodology • Generate a particular throughput of events, then measure the latency. – Throughputs measured varied between 50,000 events/s and 170,000 events/s • 100 advertising campaigns • 10 ads per campaign • SLA set at 1 second • 10 second windows • 5 Kafka nodes with 5 topic partitions • 1 Redis node • 3 ZooKeeper nodes (cluster-coordination software) • 10 worker nodes (doing computation) • Handful of nodes used by the systems as masters, other non-compute servers.
Our methodology 1. Totally clear Kafka of data 2. Populate Redis with initial data 3. Launch the advertising analytics application on Spark, Flink, or Storm 4. Wait a bit for all workers to finish launching 5. Start up producers with instructions to produce tuples at a given rate – this rate determines the throughput. – Ex: 5 producers writing 10,000 events per second generates a throughput of 50,000 events/s. 6. Let the system run for 30 minutes after starting the producers, then shut the producers down. 7. Run data gathering tool on the Redis database to generate latency points from the windows.
Hardware Setup • Homogeneous nodes, each with two Intel E5530 @2.4GHz, 16 hyperthreading cores per node • 24GiB of memory • Machines on the same rack • Gigabit Ethernet switch • The cluster has 40 nodes, 20-25 used in benchmark • Multiple instances of Kafka producers to create load – individual producers fall behind at around 17,000 events per second • The use of 10 workers for a topology is near the average number we see being used by topologies internal to Yahoo – The Storm clusters are larger, but multi-tenant & run many topologies
About the implementations • Apache Flink – Tested 0.10.1-SNAPSHOT (commit hash 7364ce1). – Application written in Java using the DataStream API. – Checkpointing – a feature that guarantees at-least-once processing – was disabled. • Apache Spark – Tested version 1.5 – Application written in Scala using the DStreams API. – At-least-once processing not implemented. • Apache Storm – Tested both versions 0.10 and 0.11-SNAPSHOT (commit hash a8d253a). – Application written using the Java API. – Acking provides at-least-once processing – turned off for high throughputs in 0.11-SNAPSHOT
Flink • Most tuples finished within 1 second SLA. • Sharp curve indicates there was a very small number of straggling tuples that were written into Redis late. • Red dots mark 1st 10th 25th 50th 75th 90th 99th and 100th percentiles.
Flink Late Tuples • Of late tuples, most were written within a few milliseconds of the SLA’s deadline. • This emphasizes only a very small number were significantly late. • Beyond about 170,000 tuples, Flink was unable to handle the throughput, and tuples backed up.
Spark Streaming • Benchmark written in Scala, using DStreams (a.k.a streaming RDDs) and direct Kafka Consumer • Micro-batching – different than the pure streaming nature of Storm and Flink – To meet 1 sec SLA, the batch duration was set to 1 second • Forced to increase the batch duration for larger throughputs • Transformations (e.g. maps and filters) applied on the Dstreams • Joining data with Redis a special case – Should not create a separate connection to Redis for each record use a mapPartitions operation that can give control of a whole RDD partition to our code • create one connection to Redis and use this single connection to query information from Redis for all the events in that RDD partition.
Spark 2-dimensional Parameter Adjustment • Micro-batch duration – This is a control dimension that is not present in a pure streaming system like Storm – Increasing the duration increases latency while reducing overhead and therefore increasing maximum throughput – Finding optimal batch duration that minimizes latency while allowing spark to handle the throughput is a time consuming process • Set a batch duration, run the benchmark for 30 minutes, check the results decrease/increase the duration • Paral elism – increasing parallelism is simpler said than done in Spark – In a true streaming system like Storm, one bolt instance can send its results to any number of subsequent bolt instances – In a micro batch system like Spark, perform a reshuffle operation • similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the cluster. • But the reshuffling itself introduces considerable overhead.
Spark • Spark had more interesting results than Flink. • Due to the micro-batch design, it was unable to process events at low latencies • The overhead of scheduling and launching a task per batch is very high • Batch size had to be increased – this overcame the launch overhead.
Spark • If we reduce the batch duration sufficiently, we get into a region where the incoming events are processed within 3 or 4 subsequent batches. • The system on the verge of falling behind, but is still manageable, and results in better latency.
Spark Falling behind • Without increasing the batch size, Spark was unable to keep up with the throughput, tuples backed up, and latencies continuously increased until the job was shut down. • After increasing the batchsize, Spark handled larger throughputs than either Storm or Flink.
Spark • Tuning the batch size was time-consuming, since it had to be done manually – this was one of the largest problems we faced in testing Spark’s Streaming capabilities. • If the batch size was set too high, latency numbers would be bad. If it was set too low, Spark would fall behind, tuples would back up, and latency numbers would be worse. • Spark had a new feature at the time called ‘backpressure’ which was supposed to help address this, but we were unable to make it work properly. In fact, enabling backpressure hindered our numbers in all cases.
Storm Results • Benchmark uses Java API, One worker process per host, each worker has 16 tasks to run in 16 executors - one for each core. • In 0.11.0, Storm added a simple back pressure control er avoid the overhead of acking – In 0.10.0 benchmark topology, acking was used for flow control but not for processing guarantees. • With acking disabled, Storm even beat Flink for latency at high throughput. – But no tuple failure handling Storm 0.10.0 Storm 0.11.0
Storm • Storm behaved very similarly to Flink. • However, Storm was unable to handle more than 130,000 events/s with its acking system enabled. • Acking keeps track of successfully processed events within Storm. • With acking disabled, Storm achieved numbers similar to Flink at throughputs up to 170,000 events/s.
Storm Late Tuples • Similar to Flink’s late tuple graph. • Tuples that were late were slightly less late than Flink’s.
Three-way Comparison • Flink and Storm have similar linear performance profiles – These two systems process an incoming event as it becomes available • Spark Streaming has much higher latency, but is expected to handle higher throughputs – System behaves in a stepwise function, a direct result from its micro-batching nature
Flink Spark Storm • Comparisons of 99-th percentile latencies are revealing. • Storm 0.11 consistently lower latency than Flink and Spark. • Flink’s latency comparable to Storm 0.10, but handled higher throughput with at-least- once guarantees. • Spark had the highest latency, but was able to handle higher throughput than either Storm or Flink
Future work • Many variables involved – many we didn’t adjust. • Applications were not optimized – al were written in a fairly plain manner and configuration settings were not tweaked • SLA deadline of 1 second is very low. We did this to test the limits of the low-latency streaming systems. Higher SLA deadlines are reasonable, and testing those would be worthwhile – likely showing Spark being highly competitive with the others. • The throughputs we tested at were incredibly high. – 170,000 events/s comes to 14688000000 events per day – 1.4*1010 events per day • Didn’t test with exactly-once semantics. • Ran smal tests and checked for correctness of computations, but didn’t check correctness at large scale. • There are many more tests that can be run. • Other streaming engines can be added.
Conclusions • The competition between near real time streaming systems is heating up, and there is no clear winner at this point • Each of the platforms studied here have their advantages and disadvantages • Other important factors: – Security or integration with tools and libraries • Active communities for these and other big data processing projects continue to innovate and benefit from each other’s advancements