Presentation is loading. Please wait.

Presentation is loading. Please wait.

Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media.

Similar presentations


Presentation on theme: "Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media."— Presentation transcript:

1 Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media

2 RSS Feed Content-syndication technology –Provides a site's content for use by other services Massively popular The content (feed) consists –Directed content itself –Metadata -- information about the content.

3 RSS Feeds (Cont.) Headlines Links to other stories Stripped of layout Mainly to notify the users of sites updates XML For example: –Blogs

4

5 How to Get Feeds Feed Aggregators –Web based: BlogLines –Desktop: SharpReader, Straw Search Engines –Snewp

6 How to Get Feeds (cont.) Registries –Sites that list the details of thousands of feeds –Tested and categorized for ease of use –Offer tools and web services (XML-RPC) http://www.syndic8.com/web_services/ –Syndic8.com 170,000 RSS Feeds –Moreover.com

7 My Project Lots of information on Dynamic Web Difficult to track Aggregators and search engines don’t help very much because: –Not automated –Don’t consider recency

8 My Project (cont.) Weight the word frequencies Create a pattern Understand what the world is paying attention to

9 Gathering RSS Feeds Internet Information Retrieval User Interface Database Backend

10 Gathering RSS Feeds 1.Download updated RSS feed from a registry

11 Gathering RSS Feeds (cont.) Process the RSS File to get the RSS feed for each source http://p.moreover.com/cgi-local/page?index_bookreviews+rss Download the data from the links provided in each RSS feed http://c.moreover.com/click/here.pl?r240105547

12 HTML XHTML Information Retrieval Information/Data Retrieval

13 Definition is somewhat loose –DR Exact matching –IR Partial matching, Best match Separating Content from metadata Retrieving data from content –Specifically Names –Words

14 Separating Content from metadata How to differentiate between main content and the rest of information –HTML/XHTML Tags –Irrelevant information such as advertisement –Main content of the page Irregularity of content –Some sites use comments to indicate beginning, end of content –Needsexperiment with different sites to find pattern

15 Retrieving data from content What to look for? Words don’t tell much –ambiguity Nouns are difficult to find –Syntactic and Semantic patterns Names (people, places)

16 Future work Refining the RSS gathering Doing research to improve the IR/DR processing –Semantics –Syntax User interface


Download ppt "Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media."

Similar presentations


Ads by Google