Joe Stein • Developer, Architect & Technologist • Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed systems. BDOSS is al about the "glue" and helping companies to not only figure out what Big Data Infrastructure Components to use but also how to change their existing (or build new) systems to work with them. • Apache Kafka Committer & PMC member • Blog & Podcast - http://allthingshadoop.com • Twitter @allthingshadoop
● LinkedIn - ● Mozil a - Kafka Apache Kafka wil soon be is used at replacing part LinkedIn for of our current activity stream production data and system to operational col ect metrics. This performance powers various and usage products like data from the LinkedIn end-users Newsfeed, browser for LinkedIn Today projects like in addition to Telemetry, Test our offline Pilot, etc. analytics Downstream systems like consumers Hadoop. usual y persist to either HDFS ● Twitter - As or HBase. part of their Storm stream ● Tagged - processing Apache Kafka infrastructure, drives our new e.g. this. pub sub system which ● Netflix - Real- delivers real- time time events for monitoring and users in our event- latest game - processing Deckadence. It pipeline. wil soon be ● Square - We used in a host use Kafka as a of new use bus to move al cases systems including group events through chat and back our various end stats and datacenters. log col ection. This includes ● Foursquare - metrics, logs, Kafka powers custom events online to online etc. On the messaging, consumer and online to side, we output offline into Splunk, messaging at Graphite, Foursquare. Esper-like real- We integrate time alerting. with ● Spotify - Kafka monitoring, is used at production Spotify as part systems, and of their our offline log delivery sy infrastructure, stem including . hadoop. ● Pinterest - ● StumbleUpon - Kafka is used Data col ection with Secor as platform for part of their analytics. log col ection p ● Coursera - At ipeline . Coursera, Kafka powers ● Uber education at ● Tumblr - See scale, serving this as the data pipeline for ● Box - At Box, realtime Kafka is used learning for the analytics/dash production boards. analytics pipeline & real ● Shopify - time Access logs, monitoring A/B testing infrastructure. events, domain We are events ("a planning to use checkout Kafka for some happened", of the new etc.), metrics, products & delivery to features HDFS, and customer ● Airbnb - Used reporting. We in our event are now pipeline, focusing on exception consumers: tracking & analytics, more to come. support tools, and fraud analysis.
● Mate1.com Inc ● AddThis - . Apache Kafka - Apache is used at kafka is used AddThis to at Mate1 as col ect events our main event generated by bus that our data powers our network and news and broker that activity feeds, data to our automated analytics review clusters and systems, and real-time web wil soon power analytics real time platform. notifications and log ● Urban Airship - distribution. At Urban Airship we use ● Boundary - Kafka to buffer Apache Kafka incoming data aggregates points from high-flow mobile devices message for processing streams into a by our unified analytics distributed infrastructure. pubsub service, ● Metamarkets - brokering the We use Kafka data for other to ingest real- internal time event systems as data, stream it part of to Storm and Boundary's Hadoop, and real-time then serve it network from our Druid analytics cluster to feed infrastructure. our interactive analytics ● Ancestry.com - dashboards. Kafka is used We've also as the built event log proce connectors for ssing pipeline directly ingest for delivering ing events better from Kafka personalized into Druid. product and service to our ● Simple - We customers. use Kafka at Simple for log ● DataSift - aggregation Apache Kafka and to power is used at our analytics DataSift as a infrastructure. col ector of monitoring ● Gnip - Kafka is events and to used in their track user's twitter consumption ingestion and of data processing streams in real pipeline. time. ● Loggly - Loggly http://highscal is the world's ability.com/blo most popular g/2011/11/29/da cloud-based tasift-architec log ture-realtime-d management. atamining-at-12 Our cloud- 0000-tweets-p.h based log tml ● Spongecel - management We use Kafka service helps to run our DevOps and entire analytics technical and monitoring teams make pipeline driving sense of the both real-time the massive and ETL quantity of applications for logs. Kafka is our customers. used as part of our ● Wooga - We log col ection use Kafka to and processing aggregate and infrastructure process . tracking data from al our facebook games (which are hosted at various providers) in a central location.
● RichRelevance ● VisualDNA We - Real-time use Kafka 1. tracking event as an pipeline. infrastructure that helps us ● SocialTwist - bring We use Kafka continuously internal y as the tracking part of our events from reliable email various queueing datacenters system. into our central ● Countandra - hadoop cluster We use a for offline hierarchical processing, 2. distributed as a counting propagation engine, uses path for data Kafka as a integration, 3. primary speedy as a real-time interface as platform for wel as routing future events for inference and cascading recommendati counting on engines ● FlyHaj .com - ● Sematext - in We use Kafka SPM to col ect al (performance metrics and monitoring + events alerting), Kafka generated by is used for the users of metrics the website. col ection and feeds SPM's ● uSwitch - See in-memory this blog. data ● InfoChimps - aggregation Kafka is part of (OLAP cube the creation) as InfoChimps real wel as our -time data plat CEP/Alerts form servers (see . also: ● Visual Revenu SPM for Kafka e performance mo - We use nitoring ). In Kafka as a SA (search anal distributed ytics) queue in front Kafka is used of our web in search and traffic stream click stream processing col ection infrastructure before being (Storm). aggregated ● Oolya - Kafka and persisted. is used as the In primary high Logsene (log an speed alytics) message Kafka is used queue to to pass logs power Storm and other and our real- events from time front-end analytics/event receivers to ingestion the persistent pipelines. backend. ● Datadog - ● Wize Commer Kafka brokers ce - At Wize data to most Commerce systems in our (previously, metrics and NexTag), events Kafka is used ingestion as a distributed pipeline. queue in front Different of Storm modules based contribute and processing for consume data search index from it, for generation. We streaming CEP plan to also (homegrown), use it for persistence (at col ecting user different generated data "temperatures" on our web in Redis, tier, landing the ElasticSearch, data into Cassandra, various data S3), or batch sinks like analysis Hadoop, (Hadoop). HBase, etc. ● Quixey - At Quixey, The Search Engine for Apps, Kafka is an integral part of our eventing, logging and messaging infrastructure.
● LinkSmart - ● Outbrain - We use Kafka is used Kafka in production for real at LinkSmart time log collection as an event and processing, stream feeding and for cross-DC Hadoop and cache propagation. custom real time systems. ● SwiftKey - We use Apache Kafka for ● LucidWorks Bi analytics event g Data processing. - We use ● Yeller - Yeller uses Kafka for Kafka to process syncing large streams of LucidWorks incoming exception data for Search (Solr) it's customers. with incoming Rate limiting, data from throttling and Hadoop and batching are all built on top of also for Kafka. sending LucidWorks ● Emerging Threats - Emerging threats Search logs uses Kafka in our back to event pipeline to Hadoop for process billions of analysis. malware events for search indices, ● Cloud Physics alerting systems, etc. - Kafka is powering our ● Hotels.com - high-flow event Hotels.com uses Kafka as pipeline pipeline that to collect real time aggregates events from over 1.2 bil ion multiple sources metric series and for sending data to HDFS. from 1000+ data centers ● Helprace - Kafka for near-to-real is used as a distributed high time data speed message center queue in our help operational desk software as analytics and well as our real- time event data modeling aggregation and ● Graylog2 - analytics. Graylog2 is a ● Exponential is free and open using Kafka in source log production to power the events management ingestion pipeline and data for real time analysis analytics and log system. It's feed consumption. using Kafka as ● Livefyre - uses default Kafka for the real time notifications, transport for analytics pipeline Graylog2 and as the primary Radio. The use mechanism for case is general pub/sub. described here ● Exoscale - uses . Kafka in production. ● Yieldbot - ● Cityzen Data - Yieldbot uses uses Kafka as kafka for real- well, we provide a time events, platform for camus for col ecting, storing batch loading, and analyzing machine data. and mirrormakers ● Criteo - use Kafka in production for for x-region over a year for replication. stream processing and log transfer ● LivePerson - (over 2M Using Kafka as messages/s and the main data growing) bus for al real ● The Wikimedia Fo time events. undation - uses Kafka as a ● Retention Scie log transport for nce analytics data from - Click stream production webservers and ingestion and applications. This processing. data is consumed into Hadoop using ● Strava - Camus and to Powers our other processors analytics of analytics data. pipeline, ● OVH - uses Kafka activity feeds in production for denorm and over a year now several other using it for event bus, data pipeline production for antiddos and services. more to come
Questions? /******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @al thingshadoop ********************************************/