● Streaming API Subscribe to realtime feeds moving forward
● REST Search API Search request on past data (1 week)
Streaming API public statuses from all users ● status/filter track/location/follow ○ 5000 follow user ids ○ 400 track keywords ○ 25 location boxes ○ rate limited
● status/sample ○ 1% of all public statuses (message id mod 100) ○ two status/sample streams will result in same data
Streaming API per user streams
● User Streams ○ all data required to update a user's display ○ requires user's OAuth token ○ statuses from followings, direct messages, mentions ○ cannot open large number of user streams from same host
● Site Streams ○ multiplexing of multiple User Streams
Streaming API Firehose need more/full data? only through partners
● gnip.com ● datasift.com
● filtering/tracking ● partial to full Firehose
What's the catch?
Streaming API Firehose Base Twitter data license
$0.10 per 1000 tweets
~$1 million/month approx for full Firehose
Streaming API Firehose startup?
● REST API (http request/response) ○ search query ○ geocode (lat, long, radius) ○ result type (mixed/recent/popular) ○ since id
● max 100 rpp and 1500 results ● rate limited (~1 request/sec)
Twitter Geo NO simple way to grab ALL tweets for a given region
Twitter Geo Streaming API ● status/filter + location (bounding box) ○ only tweets with explicit coordinates ○ < 10% of all tweets
Twitter Geo Streaming API ● Firehose ○ < 10% of all tweets contains explicit coordinates ○ must do reverse geocoding on user profile location ○ user profile location is free form
Twitter Geo Search API ● geocode (lat, long, radius) ● tweets with explicit coordinates ● tweets reverse geocoded from user profile location
● location field: free form text (Montreal / Montreal,Qc / Mtl / Mourial) ● false positives
● REST API: not for frequent polling ● rate limited (1 req/sec/ip)
Twitter Geo Solutions?
That's your job!
Twitter Geo ● search API intelligent polling farm ○ adjust polling interval to minimize polling in relation to traffic
● streaming API status/filter/follow reader farm? ○ find N relevant users from city, # stream readers = N / 5000 ○ must do reverse geocoding ○ user list dynamic update
Tweitgeist TwitterStream ExtractMessage ExtractHashtag RollingCount Rank Merge Twitter Spout Bolt Bolt Bolt Bolt Bolt UI hashtag hashtag global shuffle shuffle field field
Redis Redis queue queue stream message hashtag rolling ranking merging reader extract extract counter Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Global grouping: The entire stream goes to a single one of the bolt's tasks