Introduction to Apache Geode (incubating) Anthony Baker (@metatype) William Markito (@william_markito) London, September 2015
Title Text Subtitle Text
Agenda • Introduction to Geode • Geode concepts and usage • The Geode open source project • StockPrediction demo
History • 1000+ customers in production • Cutting edge use cases • Massive increase in data • Real Time response needs • Global data visibility needs volumes • Time to market constraints • Fast Ingest needs for data • Fal ing margins per • Need for flexible data • Need to al ow devices to transaction models across enterprise hook into enterprise data • Increasing cost of IT maintenance • Distributed development • Always on • • Need for elasticity in Persistence + In-memory systems 2004 2008 2014 • Largest Telcos • Financial Services • • Largest travel Portal Large mfrers Providers (every major • • Largest Payrol processor Airlines Wal Street bank) • Auto insurance giants • Trade clearing • Largest rail systems on • Department of Defense • Online gambling earth
Interesting use cases China Railway Indian Railways Corporation 5,700 train stations 7,000 stations 4.5 million tickets per day 72,000 miles of track 20 million daily users 23 million passengers daily 1.4 billion page views per day 120,000 concurrent users 40,000 visits per second 10,000 transactions per minute *http://pivotal.io/big-data/pivotal-gemfire
Interesting use cases China Railway Corporation Indian Railways Population: 1,401,586,609 1,251,695,616 World: ~7,349,000,000 ~36% of the world population
Roadmap • Off-heap memory storage • HDFS persistence • Lucene indexes • Spark connector • Cloud Foundry service • Distributed transactions …and other ideas from the Geode community!
Geode Concepts and Usage • Cache • Region • Member • Client Cache • Functions • Listeners
Concepts • Cache • In-memory storage and management for your data Region • Configurable through XML, Spring, Java API or CLI Region Region • Collection of Region Cache JVM
Concepts • Region Region Key Value K01 May • Distributed java.util.Map on steroids (Key/Value) K02 Tim • Consistent API regardless of where or how data is stored java.util.Map Cache • Observable (reactive) JVM • Highly available, redundant on cache Member (s).
Concepts Region Key Value Region Key Value K01 May K01 May K02 Tim K02 Tim • Region java.util.Map java.util.Map Cache Cache JVM JVM • Local, Replicated or Partitioned • In-memory or persistent LOCAL • Redundant LOCAL_HEAP_LRU LOCAL_OVERFLOW LOCAL_PERSISTENT LOCAL_PERSISTENT_OVERFLOW • LRU PARTITION PARTITION_HEAP_LRU PARTITION_OVERFLOW PARTITION_PERSISTENT • Overflow PARTITION_PERSISTENT_OVERFLOW PARTITION_PROXY PARTITION_PROXY_REDUNDANT PARTITION_REDUNDANT PARTITION_REDUNDANT_HEAP_LRU PARTITION_REDUNDANT_OVERFLOW PARTITION_REDUNDANT_PERSISTENT PARTITION_REDUNDANT_PERSISTENT_OVERFLOW REPLICATE REPLICATE_HEAP_LRU REPLICATE_OVERFLOW REPLICATE_PERSISTENT REPLICATE_PERSISTENT_OVERFLOW REPLICATE_PROXY
Concepts • Persistent Regions Put k4->v7 Member 1 • Durability Modify Oplog3.crf k4->v7 • WAL for efficient writing Modify Create Create Create Oplog2.crf • Consistent recovery k1->v5 k6->v6 k2->v2 k4->v4 • Compaction Region Key Value Region Key Value K01 May K01 May K02 Tim K02 Tim java.util.Map java.util.Map Cache Cache JVM JVM Server 1 Server N
Concepts • Member • A process that has a connection to the system Client • A process that has created a cache • Embeddable within your application Locator Server
Concepts • Client cache • A process connected to the Geode server(s) • Can have a local copy of the data • Can be notified about events on the servers GemFire Server Application Region Region Client Cache Region
Concepts • Functions • Used for distributed concurrent processing (Map/Reduce, stored procedure) • Highly available • Data oriented • Member oriented Execute Functions f1 , f2 , … fn Submit (f1)
Concepts • Functions Server Server
Server Distributed System
Partitioned Region Partitioned Region Data Store - X Data Store - Z Server Partitioned Region Data Store - Y execute 3 4 result 4 result Server Server Partitioned Region 3 Partitioned Region Data Accessor execute Data Accessor Client Region filter = Keys X, Y 1 2 FunctionService.onRegion.withFilter.execute execute
6 ResultCollector.getResult 5 result
Concepts • Listeners • CacheWriter / CacheListener • AsyncEventListener (queue / batch) • Parallel or Serial • Conflation
The Geode Project
Why Open Source? Why ASF? • Open source is fundamentally changing software buying patterns • Customers get transparency and co-development of features • It’s the community that matters • ASF provides a framework for open source
Geode Will Be a Significant Apache Project • 1M+ LOC, over a 1000 person years invested into cutting edge R&D • Thousands of production customers in very demanding verticals • Cutting edge use cases that have shaped product thinking • A core technology team that has stayed together since founding • Performance differentiators that are baked into every aspect of the product
Geode versus GemFire • Geode is a project supported by the OSS community • GemFire is product from Pivotal, based on Geode source • We donated everything but the kitchen sink* • Development process follows “The Apache Way” * Multi-site WAN replication, continuous queries, and native (C/C++) client
"Talk is cheap, show me the code" • Clone & Build git clone https://github.com/apache/incubator-geode cd incubator-geode ./gradlew build -Dskip.tests=true • Start a server cd gemfire-assembly/build/install/apache-geode ./bin/gfsh gfsh> start locator --name=locator gfsh> start server --name=server gfsh> create region --name=myRegion --type=REPLICATE
Stock Predictions with Apache Geode, Spark, and SpringXD
Quick intro to Apache Spark • RDD • Dataframe • Driver • Worker "An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."
Quick intro to Apache Spark • RDD • Dataframe • Driver • Worker “A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."
Data Temperature Apache Geode / GemFire 1- Live data is ingested into the grid Hot Spring XD Live Data 2 - Trained ML model compares new data to historical patterns 3 - Results are pushed Machine Learning Warm immediately to deployed model applications 4 - Re-training is triggered, updating the model with the latest historical data Spring XD
Machine Learning Concepts Machine Learning Model (e.g. Linear Regression) price(x) medium avg (x+1) medium avg (x) relative strength (x)
Features Machine Learning Model Label (e.g. Linear Regression) price(x) medium medium avg (x) avg (x+1) relative strength (x)