Presentation is loading. Please wait.

Presentation is loading. Please wait.

Update By: Brian Klug, Li Fan Presentation Overview: API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply)

Similar presentations


Presentation on theme: "Update By: Brian Klug, Li Fan Presentation Overview: API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply)"— Presentation transcript:

1 Update By: Brian Klug, Li Fan Presentation Overview: API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply) Infrastructure (Hardware, Storage Req’s, Design) Tentative Work Plan (Timeline and Schedule)

2 Update Enables near-real time access to a subset of public Twitter statuses. –Currently in alpha test –Access to further restricted resources is extremely limited and granted only after acceptance of an additional TOS document. We have applied for credentials which grant us access to these increased resources (namely a larger sampling, more statuses) –http://apiwiki.twitter.com/Streaming-API-Documentationhttp://apiwiki.twitter.com/Streaming-API-Documentation Features of streaming API –Continual connection that streams statuses over HTTP. Opened indefinitely and only requires basic authentication for the most basic level –Output data is in XML or JSON formats, both of which are easy to parse. –Can focus on certain tracking predicates that, when specific enough, return all occurrences in full Firehose stream EG "track=basketball,football,baseball,footy,soccer". Execute: curl -d @tracking http://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password http://stream.twitter.com/1/statuses/filter.json 2 API: Streaming API

3 Update 3 Streaming API data Example data: –{"truncated":false,"text":"@FreedomProject Can you bring the script tomorrow? We can write in the APE if you're not busy.","favorited":false,"in_reply_to_screen_name":"FreedomProject","source":" TweetDeck ","created_at":"Fri Nov 20 06:37:58 +0000 2009","in_reply_to_user_id":20688076,"in_reply_to_status_id":5882468251,"geo":null,"user":{"favourites_count":0,"ve rified":false,"notifications":null,"profile_text_color":"34da43","time_zone":"Tijuana","profile_link_color":"e98907","descri ption":"I'm a Robot created in Mexican soil, therefore my name is Mexican Robot","profile_background_image_url":"http://a3.twimg.com/profile_background_images/4329659/d2e513deb84e6fd c10de6ac70ef2f637f8f62f26.jpg","created_at":"Mon Dec 22 07:34:02 +0000 2008","profile_sidebar_fill_color":"b03636","profile_background_tile":false,"location":"Surfin' tubular Innernet waves","following":null,"profile_sidebar_border_color":"050e61","protected":false,"profile_image_url":"http://a3.twimg.c om/profile_images/515614231/jessicaavvy_normal.png","statuses_count":946,"followers_count":59,"name":"Mexican Robot","friends_count":173,"screen_name":"MexicanRobot","id":18303131,"geo_enabled":false,"utc_offset":- 28800,"profile_background_color":"000000","url":"http://sharkwithwheels.webs.com"},"id":5882552501} Data Classes: Who the message is in response to, if anyone Client user agent Location tagged geo-aware data, if any Time of creation and time zone of poster Information about avatar, background, profile User metrics: Statuses posted, Followers, Friends User description: short user-defined string

4 Update Streaming API expected volume: 3-4 million entries/day Storage Consideration: –Average total JSON example output size: ~1400 characters –Messages are UTF-8, we’ll assume most are 1 byte –1400 msg/day * 1 byte * 3.5 million = 4.56 gigabytes/day –1 year ~ 1.6 terabytes Currently working on getting at least one server running Ubuntu Server in a VM to begin downloading data –May require additional public IP addresses depending on rate limits, additional servers depending on load Download first, parse later 4 Infrastructure

5 Update Work Plan –Continue investigating using RSS to download status updates from far in the past beyond the 15,000 we are allowed to go back using the streaming API –1-2 weeks: test our environment and make sure everything is working well Make sure our methodology for downloading from the stream is resistant to Twitter downtime as features are rolled in and out of the alpha test Await possible response from Twitter regarding access to additional restricted resources (even higher rate firehose) –2 weeks to explore how to parse the content into a DB, whether this can be realistically done real time in another process. –Additional time for data mining, research topics, e.t.c. 5 Tentative Timeline


Download ppt "Update By: Brian Klug, Li Fan Presentation Overview: API we plan to use (Syntax and commands) Obtainable Data Types (Location, Text, Time, User, Reply)"

Similar presentations


Ads by Google