Multiple Consumers I wanna consume streaming data by Spark Spark on EMR Data Scientist Kinesis Stream I wanna add a streaming monitor by Lambda AWS Lambda Application Engineer Empowers Engineers to Do Trial and Error
News Delivery Pipeline Internet Crawler Analyzer Indexer CloudSearch API API Mobile App Search Gateway Kinesis Kinesis Stream Stream DynamoDB API Tracker Kinesis Stream
Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or configurations etc... • 15M+ records • Articles metadata • 100K+ records per day
How We Produce/Consume Kinesis Streams?
Index System Crawler Analyzer Indexer KPL KCL KPL KCL KPL KCL KPL KCL CloudSearch KPL KCL KPL KCL Collect, Analyze and Index Articles with Kinesis Libraries (KPL & KCL)
Kinesis Libraries • Kinesis Producer Library (KPL) • put records into an stream • asynchronous architecture (buffer records) • Kinesis Consumer Library (KCL) • consume and process data from an stream • handle complex tasks associated with distributed computing
Feedback System Article Metadata Amazon CloudSearch User API Gateway Feedback API Metrics API Kinesis DynamoDB Tracker Search Stream by Cluster User Clusters Push Amazon S3 Hive / Spark Amazon S3 Notification Offline ETL / Machine Learning Generate Metrics by User Clusters for Ranking Articles
Why Metrics by Cluster? Article Metrics by Inventry User Cluster Article raw score weight score San Fransisco Giants … 3.5 1 3.5 New York Yankees … 6.2 + 0.6 = 3 API FIFA World Cup … 20.4 0.2 4.08 User GET /news/sports U.S.Open Championships … 8.4 0.2 1.68 userId: 1000 gender: Male age: 36 location: San Fransisco, US interests: Baseball Amazon CloudSearch DynamoDB Consider Each User's Interests Ensure Diversity for Avoiding Filter Bubble https://en.wikipedia.org/wiki/Filter_bubble
Input Data by Fluentd • Forwarder (running on each instances) • archive events to S3 • forward events to aggregators • Aggregator (HA Configuration※) • put events into Kinesis Stream • alert and report (not mentioned here) ※ http://docs.fluentd.org/articles/high-availability
Spark Streaming Minutely RDD Minutely Metrics by User Cluster Shard 1 Dstream 1 Male Male Male R R R Shard 2 Dstream 2 D D D + Female Female Female D D D DynamoDB Shard3 Dstream 3 Teen Teen Teen . RDD . . Kinesis Stream Pre Computed RDD Split Streams into Minutely RDD Join Minutely RDD on PreComputed RDD
Monitor Spark Streaming Spark UI is Useful for Monitoring
Summary • Fast & stable stream processing is crucial for SmartNews • lifetime of news is very short • process events as fast as possible • Kinesis Stream plays an important role • one-click provision & scale • empowers engineers to do trial & error
Discuss More? Join Our Free Lunch in Tokyo Office!!
Realtime Monitoring Access Continuous View by PostgreSQL Client Continuous View PipelineDB Chartio Record Continuous Stream View API AWS Slack Gateway Lambda Continuous View Discard raw record soon after Incrementally consumed by Continuous View updated in realtime
Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day, hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
Dashboard in Chartio 1. Building query (Drag&Drop / SQL) 2. Add step (filter、sort、modify) 3. Select visualize way (table、graph)