@tw1tt3rart TW1TT3Rart ┈┈┈┈┈◢◤┈┈┈┈┈┈┈┈ ┈┈◢▇▇▇▇▇◣┈┈┈┈┈┈ ┈┈▇▇▇▇▇◤┈┈THANK┈┈ ┈┈▇▇▇▇▇┈┈┈┈YOU┈┈ ┈┈◥▇▇▇▇◣┈┈┈STEVE┈ ┈┈┈◥▇◤◥▇◤┈┈┈┈┈┈ #ThankYouSteve #TwitterArt 6 Oct via web Favorite Retweet Reply
@tw1tt3rart TW1TT3Rart ┈┈┈┈┈◢◤┈┈┈┈┈┈┈┈ ┈┈◢▇▇▇▇▇◣┈┈┈┈┈┈ ┈┈▇▇▇▇▇◤┈┈THANK┈┈ ┈┈▇▇▇▇▇┈┈┈┈YOU┈┈ ┈┈◥▇▇▇▇◣┈┈┈STEVE┈ ┈┈┈◥▇◤◥▇◤┈┈┈┈┈┈ #ThankYouSteve #TwitterArt 6 Oct via web Favorite Retweet Reply “Creativity comes from constraint” “Brevity is the soul of the wit”
What is the scale of Twitter?
500,000,000 Tweets / Day 3,500,000,000 Tweets / Week
3.5B Tweets / Week
≈ 6000+ Tweets / Second (steady state) However, there are peaks!
What was wrong? Fragile monolithic Rails code base: managing raw database and memcache connections to rendering the site and presenting the public APIs Throwing machines at the problem: instead of engineering solutions Trapped in an optimization corner: trade of readability and flexibility for performance
Whale Hunting Expeditions We organized archeology digs and whale hunting expeditions to understand large scale failures
Re-envision the system? We wanted big infra wins: in performance, reliability and efficiency (reduce machines to run Twitter by 10x) Failure is inevitable in distributed systems: we wanted to isolate failures across our infrastructure Cleaner boundaries with related logic in one place: desire for a loosely coupled services oriented model at the systems level
Ruby VM Reflection Started to evaluate our front end server tier: CPU, RAM and network Rails machines were being pushed to the limit: CPU and RAM maxed but not network (200-300 requests/host) Twitter’s usage was growing: it was going to take a lot of machines to keep up with the growth curve
The JVM Solution Level of trust with the JVM with previous experience JVM is a mature and world class platform Huge mature ecosystem of libraries Polyglot possibilities (Java, Scala, Clojure, etc)
Decomposing the Monolith Created services based on our core nouns: Tweet service User service Timeline service DM service Social Graph service ....
Routing Presentation Logic Storage HTTP THRIFT THRIFT* MySQL Monorail Tweet Store API Tweet Service Flock Web User Service TFE (netty) User Store (reverse proxy) Timeline Search Service Cache SocialGraph Feature X Service Memcached Feature Y DM Service Redis
Twitter Stack A peak at some of our technology Finagle, Zipkin, Scalding and Mesos
Services: Concurrency is Hard Decomposing the monolith: each team took slightly different approaches to concurrency Different failure semantics across teams: no consistent back pressure mechanism Failure domains informed us of the importance of having a unified client/server library: deal with failure strategies and load balancing
Tracing with Zipkin Zipkin hooks into the transmission logic of Finagle and times each service operation; gives you a visual representation where most of the time to fulfil a request went. https://github.com/twitter/zipkin
Hadoop with Scalding Services receive a ton of traffic and generate a ton of use log and debugging entries. Scalding is a open source Scala library that makes it easy to specify MapReduce jobs with the benefits of functional programming! https://github.com/twitter/scalding
Data Center Evils The evils of single tenancy and static partitioning Different jobs... different utilization profiles... Can we do better? DATACENTER STATIC PARTITIONING 33% 33% 33% 0% 0% 0%
Mesos, Linux and cgroups Apache Mesos: kernel of the data center obviates the need for virtual machines* isolation via Linux cgroups (CPU, RAM, network, FS) reshape clusters dynamical y based on resources multiple frameworks; scalability to 10,000s of nodes
Data Center Computing Reduce CapEx/OpEx via efficient utilization of HW http://mesos.apache.org 33% reduces CapEx and OpEx! 0% 33% 0% 33% reduces latency! 0%
Data Center Computing Reduce CapEx/OpEx via efficient utilization of HW http://mesos.apache.org 33% reduces CapEx and OpEx! 100% 0% 33% 75% 50% 0% 25% 33% 0% reduces latency! 0%
How did it all turn out? Not bad... not bad at all... Where did the fail whale go?
Site Success Rate Today :) 100% Off the monorail not a lot of traffic World Cup 99._% 2006 2010 2013
Performance Today :)
Growth Continues Today... 2000+ Employees Worldwide 50% Employees are Engineers 230M+ Active Users 500M+ Tweets per Day 35+ Languages Supported 75% Active Users are on Mobile 100+ Open Source Projects
Concluding Thoughts Lessons Learned
Lesson #1 Embrace open source best of breed solutions are open these days learn from your peers code and university research don’t only consume, give back to enrich ecosystem: http://twitter.github.io
Lesson #2 Incremental change always wins increase chance of success by making small changes small changes add up with minimized risk loosely coupled micro services work