Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media.

Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media

RSS Feed Content-syndication technology –Provides a site's content for use by other services Massively popular The content (feed) consists –Directed content itself –Metadata -- information about the content.

RSS Feeds (Cont.) Headlines Links to other stories Stripped of layout Mainly to notify the users of sites updates XML For example: –Blogs

How to Get Feeds Feed Aggregators –Web based: BlogLines –Desktop: SharpReader, Straw Search Engines –Snewp

How to Get Feeds (cont.) Registries –Sites that list the details of thousands of feeds –Tested and categorized for ease of use –Offer tools and web services (XML-RPC) http://www.syndic8.com/web_services/ –Syndic8.com 170,000 RSS Feeds –Moreover.com

My Project Lots of information on Dynamic Web Difficult to track Aggregators and search engines don’t help very much because: –Not automated –Don’t consider recency

My Project (cont.) Weight the word frequencies Create a pattern Understand what the world is paying attention to

Gathering RSS Feeds Internet Information Retrieval User Interface Database Backend

Gathering RSS Feeds 1.Download updated RSS feed from a registry

Gathering RSS Feeds (cont.) Process the RSS File to get the RSS feed for each source http://p.moreover.com/cgi-local/page?index_bookreviews+rss Download the data from the links provided in each RSS feed http://c.moreover.com/click/here.pl?r240105547

HTML XHTML Information Retrieval Information/Data Retrieval

Definition is somewhat loose –DR Exact matching –IR Partial matching, Best match Separating Content from metadata Retrieving data from content –Specifically Names –Words

Separating Content from metadata How to differentiate between main content and the rest of information –HTML/XHTML Tags –Irrelevant information such as advertisement –Main content of the page Irregularity of content –Some sites use comments to indicate beginning, end of content –Needsexperiment with different sites to find pattern

Retrieving data from content What to look for? Words don’t tell much –ambiguity Nouns are difficult to find –Syntactic and Semantic patterns Names (people, places)

Future work Refining the RSS gathering Doing research to improve the IR/DR processing –Semantics –Syntax User interface

Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media.

Similar presentations

Presentation on theme: "Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media.

Similar presentations

Presentation on theme: "Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media."— Presentation transcript:

Similar presentations

About project

Feedback