Update By: Brian Klug, Li Fan Presentation Overview: API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply) Infrastructure (Hardware, Storage Req’s, Design) Tentative Work Plan (Timeline and Schedule)
Update Enables near-real time access to a subset of public Twitter statuses. –Currently in alpha test –Access to further restricted resources is extremely limited and granted only after acceptance of an additional TOS document. We have applied for credentials which grant us access to these increased resources (namely a larger sampling, more statuses) – Features of streaming API –Continual connection that streams statuses over HTTP. Opened indefinitely and only requires basic authentication for the most basic level –Output data is in XML or JSON formats, both of which are easy to parse. –Can focus on certain tracking predicates that, when specific enough, return all occurrences in full Firehose stream EG "track=basketball,football,baseball,footy,soccer". Execute: curl -uAnyTwitterUser:Password 2 API: Streaming API
Update 3 Streaming API data Example data: Can you bring the script tomorrow? We can write in the APE if you're not busy.","favorited":false,"in_reply_to_screen_name":"FreedomProject","source":" TweetDeck ","created_at":"Fri Nov 20 06:37: ","in_reply_to_user_id": ,"in_reply_to_status_id": ,"geo":null,"user":{"favourites_count":0,"ve rified":false,"notifications":null,"profile_text_color":"34da43","time_zone":"Tijuana","profile_link_color":"e98907","descri ption":"I'm a Robot created in Mexican soil, therefore my name is Mexican Robot","profile_background_image_url":" c10de6ac70ef2f637f8f62f26.jpg","created_at":"Mon Dec 22 07:34: ","profile_sidebar_fill_color":"b03636","profile_background_tile":false,"location":"Surfin' tubular Innernet waves","following":null,"profile_sidebar_border_color":"050e61","protected":false,"profile_image_url":" om/profile_images/ /jessicaavvy_normal.png","statuses_count":946,"followers_count":59,"name":"Mexican Robot","friends_count":173,"screen_name":"MexicanRobot","id": ,"geo_enabled":false,"utc_offset": ,"profile_background_color":"000000","url":" Data Classes: Who the message is in response to, if anyone Client user agent Location tagged geo-aware data, if any Time of creation and time zone of poster Information about avatar, background, profile User metrics: Statuses posted, Followers, Friends User description: short user-defined string
Update Streaming API expected volume: 3-4 million entries/day Storage Consideration: –Average total JSON example output size: ~1400 characters –Messages are UTF-8, we’ll assume most are 1 byte –1400 msg/day * 1 byte * 3.5 million = 4.56 gigabytes/day –1 year ~ 1.6 terabytes Currently working on getting at least one server running Ubuntu Server in a VM to begin downloading data –May require additional public IP addresses depending on rate limits, additional servers depending on load Download first, parse later 4 Infrastructure
Update Work Plan –Continue investigating using RSS to download status updates from far in the past beyond the 15,000 we are allowed to go back using the streaming API –1-2 weeks: test our environment and make sure everything is working well Make sure our methodology for downloading from the stream is resistant to Twitter downtime as features are rolled in and out of the alpha test Await possible response from Twitter regarding access to additional restricted resources (even higher rate firehose) –2 weeks to explore how to parse the content into a DB, whether this can be realistically done real time in another process. –Additional time for data mining, research topics, e.t.c. 5 Tentative Timeline