HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems 今日は!
Outline Why should you care?
(Intro) What is Hadoop? How does it work? Hadoop MapReduce The Hadoop Ecosystem Questions
Data is everywhere. Data is
“I keep saying that the
sexy job in the next 10 years will be statisticians, and I‟m not kidding.” Hal Varian (Google‟s chief economist)
Are you throwing away data?
Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
So, what‟s Hadoop?
Apache Hadoop is an open-source
system to reliably store and process A LOT of information across many commodity computers.
Two Core Components HDFS Map/Reduce
Self-healing high-bandwidth clustered storage. Fault-tolerant distributed processing. Store Process
What makes Hadoop special?
Hadoop separates distributed system fault-
tolerance code from application logic. Systems Programmers Statisticians Unicorns
Hadoop lets you interact with
a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
Hadoop scales linearly with data
size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data Hadoop works for both applications!
file into 64MB blocks and stored it on the DataNodes. • Now, we want to process that data.
The MapReduce Programming Model
You specify map() and reduce()
functions. The framework does the rest.
map() map: K₁,V₁→list K₂,V₂ Key:
byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!
Input Format • Wait! HDFS
is not a Key-Value store! • InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
The Shuffle Each map output
is assigned to a “reducer” based on its key map output is grouped and sorted by key
Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
Hive Example CREATE TABLE movie_rating_data
( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't„ STORED AS TEXTFILE; LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT AVG(rating) FROM movie_rating_data GROUP BY movieid;
The Hadoop Ecosystem (Column DB)
Hadoop in the Wild (yes,
it‟s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ‟09) Facebook: 15TB new data per day; 1200 machines, 21PB in one cluster Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
What about real time access?
• MapReduce is a batch system • The fastest MR job takes 24 seconds • HDFS just stores bytes, and is append- only • Not about to serve data for your next web site.
Apache HBase HBase is an
open source, distributed, sorted map modeled after Google‟s BigTable
cluster: 600 nodes, ~600TB • Most clusters: 5-20 nodes, 100GB-4TB • Writes: 1-3ms, 1k-10k writes/sec per node • Reads: 0-3ms cached, 10-30ms disk • 10-40k reads / second / node from cache • Cell size: 0-3MB preferred
HBase compared • Favors Consistency
over Availability (but availability is good in practice!) • Great Hadoop integration (very efficient bulk loads, MapReduce analysis) • Ordered range partitions (not hash) • Automatically shards/scales (just turn on more servers) • Sparse column storage (not key-value)
HBase in Production • Facebook
(product release soon) • StumbleUpon / su.pr • Mozilla (receives crash reports) • … many others