Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media
RSS Feed Content-syndication technology –Provides a site's content for use by other services Massively popular The content (feed) consists –Directed content itself –Metadata -- information about the content.
RSS Feeds (Cont.) Headlines Links to other stories Stripped of layout Mainly to notify the users of sites updates XML For example: –Blogs
How to Get Feeds Feed Aggregators –Web based: BlogLines –Desktop: SharpReader, Straw Search Engines –Snewp
How to Get Feeds (cont.) Registries –Sites that list the details of thousands of feeds –Tested and categorized for ease of use –Offer tools and web services (XML-RPC) –Syndic8.com 170,000 RSS Feeds –Moreover.com
My Project Lots of information on Dynamic Web Difficult to track Aggregators and search engines don’t help very much because: –Not automated –Don’t consider recency
My Project (cont.) Weight the word frequencies Create a pattern Understand what the world is paying attention to
Gathering RSS Feeds Internet Information Retrieval User Interface Database Backend
Gathering RSS Feeds 1.Download updated RSS feed from a registry
Gathering RSS Feeds (cont.) Process the RSS File to get the RSS feed for each source Download the data from the links provided in each RSS feed
HTML XHTML Information Retrieval Information/Data Retrieval
Definition is somewhat loose –DR Exact matching –IR Partial matching, Best match Separating Content from metadata Retrieving data from content –Specifically Names –Words
Separating Content from metadata How to differentiate between main content and the rest of information –HTML/XHTML Tags –Irrelevant information such as advertisement –Main content of the page Irregularity of content –Some sites use comments to indicate beginning, end of content –Needsexperiment with different sites to find pattern
Retrieving data from content What to look for? Words don’t tell much –ambiguity Nouns are difficult to find –Syntactic and Semantic patterns Names (people, places)
Future work Refining the RSS gathering Doing research to improve the IR/DR processing –Semantics –Syntax User interface