Download presentation
Presentation is loading. Please wait.
1
Background “Dynamic” web –Blogs The most look-ed up word on Merriam- Webster's internet site this year –RSS Feeds Mass Media
2
RSS Feed Content-syndication technology –Provides a site's content for use by other services Massively popular The content (feed) consists –Directed content itself –Metadata -- information about the content.
3
RSS Feeds (Cont.) Headlines Links to other stories Stripped of layout Mainly to notify the users of sites updates XML For example: –Blogs
5
How to Get Feeds Feed Aggregators –Web based: BlogLines –Desktop: SharpReader, Straw Search Engines –Snewp
6
How to Get Feeds (cont.) Registries –Sites that list the details of thousands of feeds –Tested and categorized for ease of use –Offer tools and web services (XML-RPC) http://www.syndic8.com/web_services/ –Syndic8.com 170,000 RSS Feeds –Moreover.com
7
My Project Lots of information on Dynamic Web Difficult to track Aggregators and search engines don’t help very much because: –Not automated –Don’t consider recency
8
My Project (cont.) Weight the word frequencies Create a pattern Understand what the world is paying attention to
9
Gathering RSS Feeds Internet Information Retrieval User Interface Database Backend
10
Gathering RSS Feeds 1.Download updated RSS feed from a registry
11
Gathering RSS Feeds (cont.) Process the RSS File to get the RSS feed for each source http://p.moreover.com/cgi-local/page?index_bookreviews+rss Download the data from the links provided in each RSS feed http://c.moreover.com/click/here.pl?r240105547
12
HTML XHTML Information Retrieval Information/Data Retrieval
13
Definition is somewhat loose –DR Exact matching –IR Partial matching, Best match Separating Content from metadata Retrieving data from content –Specifically Names –Words
14
Separating Content from metadata How to differentiate between main content and the rest of information –HTML/XHTML Tags –Irrelevant information such as advertisement –Main content of the page Irregularity of content –Some sites use comments to indicate beginning, end of content –Needsexperiment with different sites to find pattern
15
Retrieving data from content What to look for? Words don’t tell much –ambiguity Nouns are difficult to find –Syntactic and Semantic patterns Names (people, places)
16
Future work Refining the RSS gathering Doing research to improve the IR/DR processing –Semantics –Syntax User interface
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.