A real-time Lambda Architecture using Hadoop & Storm NoSQL Matters Cologne 2014 by Nathan Bijnens NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14
Speaker Nathan Bijnens Big Data Engineer @ Virdata @nathan_gs
Computing Trends Past Current Computation (CPUs) Computation Cheap Expensive (Many Core Computers) Disk Storage Expensive Disk Storage Cheap (Cheap Commodity Disks) DRAM Expensive DRAM / SSD Getting Cheap Coordination Easy Coordination Hard (Latches Don’t Often Hit) (Latches Stall a Lot, etc) Source: Immutability Changes Everything - Pat Helland, RICON2012
Credits Nathan Marz ● Ex-Backtype & Twitter ● Startup in Stealthmode Creator of ● Storm ● Cascalog ● ElephantDB Coined the term Lambda Architecture. manning.com/marz
a Data System
Data is more than Information Not all information is equal. Some information is derived from other pieces of information.
Data is more than Information Eventually you will reach the most ‘raw’ form of information. This is the information you hold true, simply because it exists. Let’s call this ‘data’, very similar to ‘event’.
Events: Before Events used to manipulate the master data.
Events: After Today, events are the master data.
Data System Let’s store everything.
Data System Data is Immutable.
Data System Data is Time Based.
Capturing change Traditionally INSERT INTO contact (name, city) VALUES (‘Nathan’, ‘Antwerp’) UPDATE contact SET city = ‘Cologne’ WHERE name = ‘Nathan’
Capturing change in a Data System INSERT INTO contact (name, city, timestamp) VALUES (‘Nathan’, ‘Antwerp’, 2008-10-11 20:00Z) INSERT INTO contact (name, city, timestamp) VALUES (‘Nathan’, ‘Cologne’, 2014-04-29 10:00Z)
Query The data you query is often transformed, aggregated, ... Rarely used in it’s original form.
Query Query = function ( all data )
Query: Number of people living in each city Person City Timestamp Nathan Antwerp 2008-10-11 City Count John Cologne 2010-01-23 Antwerp 1 Dirk Antwerp 2012-09-12 Cologne 2 Nathan Cologne 2014-04-29
MapReduce 1. Take a large data set and divide it into subsets … 2. Perform the same function on all subsets MAP DoWork() DoWork() DoWork() … 3. Combine the output from all subsets … REDUCE Output
Serialization & Schema Catch errors as quickly as they happen. Validate on write vs on read. Catch errors as quickly as they happen. Validate on write vs on read.
Serialization & Schema CSV is actually a serialization language that is just poorly defined.
Serialization & Schema Use a format with a schema ● Thrift ● Avro ● Protocolbuffers Could be combined with Parquet. Added bonus: it’s faster and uses less space.
Batch View Database Read Only database No random writes required.
Batch View Database Every iteration produces the views from scratch.
Batch View Databases Pure Lambda databases ● ElephantDB ● SploutSQL Databases with a batch load & read only views ● Voldemort Other databases that could be used ● ElasticSearch/Solr: generate the lucene indexes using MapReduce ● Cassandra: generate sstables ● ...
Batch Layer Eventually consistent Without the associated complexities.
Batch Layer We are not done yet… Just a few hours of data. Data absorbed into Batch Views Not yet absorbed. Time Now
Speed Layer Cassandra Incoming Data Hadoop ElephantDB
Speed Layer Stream processing.
Speed Layer Continuous computation.
Speed Layer Storing a limited window of data. Compensating for the last few hours of data.
Speed Layer All the complexity is isolated in the Speed Layer. If anything goes wrong, it’s auto-corrected.
CAP You have a choice between: ● Availability ○ Queries are eventual consistent ● Consistency ○ Queries are consistent
Eventual accuracy Some algorithms are hard to implement in real-time. For those cases we could estimate the results.
Speed Layer Storm
Storm Message passing
Storm Distributed processing
Storm Horizontally scalable.
Storm Incremental algorithms
Storm Tuple Stream
Storm Spout Bolt
Data Ingestion Queues & Pub/Sub models are a natural fit.
Data Ingestion ● Kafka ● Flume ● Scribe ● *MQ ● …
Speed Layer Views The views need to be stored in a random writable database.
Speed Layer Views The logic behind a R/W database is much more complex than a read-only view.
Speed Layer Views The views are stored in a Read & Write database. ● Cassandra ● Hbase ● Redis ● SQL ● ElasticSearch ● ...
Serving Layer Cassandra Incoming Data Query Hadoop ElephantDB
Serving Layer Random reads.
Serving layer This layer queries the batch & real-time views and merges it.
CQRS & Event Sourcing Event Sourcing ● Every command is a new event. ● The event store keeps all events, new events are appended. ● Any query loops through all related events, even to produce an aggregate. source: CQRS Journey - Microsoft Patterns & Practices
Lambda Architecture The Lambda Architecture can discard any view, batch and real-time, and just recreate everything from the master data.
Lambda Architecture Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view.
Lambda Architecture Data storage is highly optimized.
Virdata is the cross-industry cloud service/platform for the Internet of Things. Designed to elastically scale to monitor and manage an unprecedented amount of devices and applications using concurrent persistent connections, Virdata opens the door to numerous new business opportunities. Virdata combines Publish-Subscribe based Distributed Messaging, Complex Event Processing and state-of-the-art Big Data paradigms to enable both historical & real-time monitoring and near real-time analytics with a scale required for the Internet of Things.