BigQuery BQ Stream Norikra + Fluentd CEP How to analyze big data? Lambda Arch
At Google, we have “big” big data everywhere What if a Googler is asked: “Can you give me the list of top 20 Android apps installed in 2012?”
At Google, SELECT we run SQLs top(appId, 20) AS app, on Dremel count(*) AS count FROM installlog.2012; = Google BigQuery ORDER BY count DESC It scans 68B rows in ~30 sec, No index used.
Column Oriented Storage Less bandwidth, More compression Record Oriented Storage Column Oriented Storage
Massively Parallel Processing Each query runs on thousands of servers select top(title), count(*) from publicdata:samples.wikipedia Scanning 1 TB in 1 sec takes 5,000 disks
Fast aggregation by tree structure ORDER BY count_babies DESC Mixer 0 LIMIT 10 COUNT(*) Mixer 1 Mixer 1 GROUP BY state COUNT(*) Leaf Leaf Leaf Leaf GROUP BY state WHERE year >= 1980 and year < 1990 Distributed Storage SELECT state, year
How to collect big data?
BigQuery BQ Stream Norikra + Fluentd CEP How to collect big data? Lambda Arch
BigQuery Streaming 100,000 rows per second x tables Real time availability of data Low cost: $0.01 per 100,000 rows
Slideshare uses Fluentd for collecting logs from >500 servers. "We take full advantage of its extendable plugin architecture and use it as a message bus that collects data from hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer
Why Fluentd? Because it’s super easy to use, and has extensive plugins written by active community.
Now Fluentd logs can be imported to BigQuery really easy
How to analyze in real-time?
BigQuery BQ Stream Norikra + Fluentd CEP How to analyze in real-time? Lambda Arch
Norikra: an open source Complex Event Processing (CEP) Production use at LINE, the largest asian SNS with 400M users, for massive log analysis
Real-time analysis on streaming data with in-memory continuous query
How to analyze big data in real-time?
BigQuery BQ Stream Norikra + Fluentd CEP Lambda How to analyze big data in real-time? Arch
Lambda Architecture is: A complementary pair of: - in-memory real-time processing - large HDD/SSD batch processing Slow, but large and persistent. Fast, but small and volatile. Proposed by Nathan Marz ex. Twitter Summingbird
A Recipe for a Lambda Architecture in 10 minutes 1 Fluentd: event log collection from various event sources 2 Norikra: scalable real time Complex Event Processing (CEP) 3 BigQuery: scalable query engine for large datasets 4 Google Spreadsheet: flexible dashboard with a variety of charts 5 Docker: repeatable deployment in 10 minutes
Lambda Arch by BQ+Norikra
Google Spreadsheet as a dashboard for real-time and big data views
Applications Real-time KPI Dashboard ● Gaming: How many new users has purchased the first item in last 10 minutes? ● Media: How many people hit the vote button during the live TV program? ● Retail: What is the current total revenue of all stores nationwide? ● Ads: What is the conversion rate of impressions/clicks to purchase? Real-time Monitoring and Alerting ● Co-relate system resource usage with access/application logs ● Real-time DoS or cheating detection ● Send e-mail notification from Apps Script triggered by CEP query
Solution Benefits Real-time analytics by Norikra CEP with 10 sec latency Big data collection and analytics by BigQuery + Fluentd at ~1M rows/s Real-time dashboard with Google Spreadsheet Deployable within 10 min with Docker Available on GitHub: GoogleCloudPlatform/lambda-dashboard