Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds Ian Rose 1, Rohan Murty 1, Peter Pietzuch 2, Jonathan Ledlie 1, Mema Roussopoulos.

Slides:



Advertisements
Similar presentations
FeedEx: Collaborative Exchange of News Feeds Seung Jun and Mustaque Ahamad Georgia Tech Information Security Center Georgia Institute of Technology.
Advertisements

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
Ningning HuCarnegie Mellon University1 Optimizing Network Performance In Replicated Hosting Peter Steenkiste (CMU) with Ningning Hu (CMU), Oliver Spatscheck.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Corona: A High Performance Publish-Subscribe System for the World Wide Web Authors: V. Ramasubramanian, R. Peterson and E.G. Sirer Cornell University Presenter:
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Ken Birman Cornell University. CS5410 Fall
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
FeedTree: Sharing Web Micronews with Peer-to-Peer Event Notification D. Sandler, A. Mislove, A. Post, P. Druschel Presented by: Andrew Sutton.
CSE 190: Internet E-Commerce Lecture 16: Performance.
King : Estimating latency between arbitrary Internet end hosts Krishna Gummadi, Stefan Saroiu Steven D. Gribble University of Washington Presented by:
Corona: A High Performance Publish-Subscribe System for the World Wide Web Cornell University, Ithaca, NY Networked System Design and Implementation (NSDI),
Locality-Aware Request Distribution in Cluster-based Network Servers Presented by: Kevin Boos Authors: Vivek S. Pai, Mohit Aron, et al. Rice University.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Computer Science Cataclysm: Policing Extreme Overloads in Internet Applications Bhuvan Urgaonkar and Prashant Shenoy University of Massachusetts.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.
Master Thesis Defense Jan Fiedler 04/17/98
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Chapter Chapter 3 Internet Agents. Chapter Contents Background Web Search Agents Information Filtering Agents Notification Agents Other Service.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.
Module 10 Administering and Configuring SharePoint Search.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
PEERSPECTIVE.MPI-SWS.ORG ALAN MISLOVE KRISHNA P. GUMMADI PETER DRUSCHEL BY RAGHURAM KRISHNAMACHARI Exploiting Social Networks for Internet Search.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
VLDB2005 CMS-ToPSS: Efficient Dissemination of RSS Documents Milenko Petrovic Haifeng Liu Hans-Arno Jacobsen University of Toronto.
Search Engine-Crawler Symbiosis: Adapting to Community Interests
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Subscription Partitioning and Routing in Content-based Publish/Subscribe Networks Yi-Min Wang, Lili Qiu, Dimitris Achlioptas, Gautam Das, Paul Larson,
Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.
Program Assessment User Session Experts (PAUSE) Information Sessions: RSS & Subscription Services October , 2006.
11 Why tune relevance Because we want to find the one single best item, among a large group of possible candidates….
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data mining in web applications
Information Retrieval in Practice
Where it is today and how it is used.
Coral: A Peer-to-peer Content Distribution Network
Web Mining Ref:
Edge computing (1) Content Distribution Networks
Data Mining Chapter 6 Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Panagiotis G. Ipeirotis Luis Gravano
The Search Engine Architecture
Presentation transcript:

Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds Ian Rose 1, Rohan Murty 1, Peter Pietzuch 2, Jonathan Ledlie 1, Mema Roussopoulos 1, Matt Welsh 1 1 Harvard School of Engineering and Applied Sciences 2 Imperial College London

Ian Rose – Harvard University NSDI Motivation Explosive growth of the “blogosphere” and other forms of RSS-based web content. Currently over 72 million weblogs tracked ( How can we provide an efficient, convenient way for people to access content of interest in near-real time?

Ian Rose – Harvard University NSDI Source:

Ian Rose – Harvard University NSDI Source:

Ian Rose – Harvard University NSDI

Ian Rose – Harvard University NSDI Challenges Scalability –How can we efficiently support large numbers of RSS feeds and users? Latency –How do we ensure rapid update detection? Provisioning –Can we automatically provision our resources? Network Locality –Can we exploit network locality to improve performance?

Ian Rose – Harvard University NSDI Current Approaches RSS Readers (Thunderbird) –topic-based (URL), inefficient polling model Topic Aggregators (Technorati) –topic-based (pre-defined categories) Blog Search Sites (Google Blog Search) –closed architectures, unknown scalability and efficiency of resource usage

Ian Rose – Harvard University NSDI Outline Architecture Overview –Services: Crawler, Filter, Reflector Provisioning Approach Locality-Aware Feed Assignment Evaluation Related & Future Work

Ian Rose – Harvard University NSDI General Architecture

Ian Rose – Harvard University NSDI Crawler Service 1.Retrieve RSS feeds via HTTP. 2.Hash full document & compare to last value. 3.Split document into individual articles. Hash each article & compare to last value. 4.Send each new article to downstream filters.

Ian Rose – Harvard University NSDI Filter Service 1.Receive subscriptions from reflectors and index for fast text matching (Fabret ’01). 2.Receive articles from crawlers and match each against all subscriptions. 3.Send articles that match  1 subscription to host reflectors.

Ian Rose – Harvard University NSDI Reflector Service 1.Receive subscriptions from web front-end; create article “hit queue” for each. 2.Receive articles from filters and add to the hit queues of matching subscriptions. 3.When polled by a client, return articles in hit queue as an RSS feed.

Ian Rose – Harvard University NSDI Hosting Model Currently, we envision hosting Cobra services in networked data centers. –Allows basic assumptions regarding node resources. –Node “churn” typically very infrequent. Adapting Cobra to a peer-to-peer setting may also be possible, but this is unexplored.

Ian Rose – Harvard University NSDI Provisioning We employ an iterative, greedy, heuristic to automatically determine the services required for specific performance targets.

Ian Rose – Harvard University NSDI Provisioning Algorithm: 1.Begin with minimal topology (3 services). 2.Identify a service violation (in-BW, out- BW, CPU, memory). 3.Eliminate the violation by “decomposing” service into multiple replicas, distributing load across them. 4.Continue until no violations remain.

Ian Rose – Harvard University NSDI Provisioning: Example BW: 25 Mbps Memory: 1 GB CPU: 4x subscriptions: 6M feeds: 600K

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Provisioning: Example

Ian Rose – Harvard University NSDI Locality-Aware Feed Assignment We focus on crawler-feed locality. Offline latency estimates between crawlers and web sources via King02 1. Cluster feeds to “nearby” crawlers. 1 Gummadi et al., King: Estimating Latency between Arbitrary Internet End Hosts

Ian Rose – Harvard University NSDI Evaluation Methodology Synthetic user queries: number of words per query based on Yahoo! search query data, actual words drawn from Brown corpus. List of 102,446 real feeds from syndic8.com Scale up using synthetic feeds, with empirically determined distributions for update rates and content sizes (based in part on Liu et al., IMC ’05).

Ian Rose – Harvard University NSDI Benefit of Intelligent Crawling One crawl of all 102,446 feeds over 15 minutes, using 4 crawlers. BW usage recorded for varying filtering levels. Overall, crawlers are able to reduce bw usage by 99.8% through intelligent crawling.

Ian Rose – Harvard University NSDI Locality-Aware Feed Assignment

Ian Rose – Harvard University NSDI Scalability Evaluation: BW Subs1M10M20M40M Feeds100K1M500K250K Total Nodes Crawlers1111 Filters Reflectors Four topologies evaluated on Emulab w/ synthetic feeds: Bandwidth usage scales well with feeds and users.

Ian Rose – Harvard University NSDI Intra-Network Latency Total user latency = crawl latency + polling latency + intra-network latency Overall, intra-network latencies are largely dominated by crawling and polling latencies.

Ian Rose – Harvard University NSDI Provisioner-Predicted Scaling

Ian Rose – Harvard University NSDI Related Work Traditional distributed pub/sub systems, e.g. Siena (Univ. of Colorado): –Address decentralized event matching and distribution. –Typically do not (directly) address overlay provisioning. –Often do not interoperate well with existing web infrastructure.

Ian Rose – Harvard University NSDI Related Work Corona (Cornell) is an RSS-specific pub/sub system –topic-based (subscribe to URLs) –Attempts to minimize both polling load on content servers (feeds) and update detection delay. –Does not specifically address scalability, in terms of feeds or subscriptions.

Ian Rose – Harvard University NSDI Future Work Many open directions: –evaluating real user subscriptions & behavior –more sophisticated filtering techniques (e.g. rank by relevance, proximity of query words in article) –subscription clustering on reflectors –how to discover new feeds & blogs

Ian Rose – Harvard University NSDI Thank you! Questions?

Ian Rose – Harvard University NSDI extra slides

Ian Rose – Harvard University NSDI The Naïve method… “Back of the envelope” approximations: –1 user polling 50M feeds every 60 minutes would use ~560 Mbps of bw –1 server serving 500M users Feeds every 60 minutes would use ~5.5 Gbps of bw

Ian Rose – Harvard University NSDI Comparison to Other Search Engines Created blogs on 2 popular blogging sites (LiveJournal and Blogger.com) Polled for our posts on Feedster, Blogdigger, Google Blog Search After 4 months: –Feedster & Blogdigger had no results (perhaps posts were spam filtered?) –Google latency varied from 83s to 6.6 hours (perhaps use of ping service?)

Ian Rose – Harvard University NSDI FeedTree Requires special client software. Relies on “good will” (donating BW) of participants.

Ian Rose – Harvard University NSDI Reflector Memory Usage

Ian Rose – Harvard University NSDI Match-Time Performance

Ian Rose – Harvard University NSDI Source: