Download presentation
Presentation is loading. Please wait.
1
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng
2
2 Outline Two important issues: Web Dynamics Search Engines Web is related to Tim Berners-Lee? Bill Gates? Dik? Frederick?Wilfred ? (March 11, 1890 – June 30, 1974)
3
3 Introduction The Web: the largest collection of (linked) resources (cf Memex machine in 1945, Xanadu in 1965, Internet in 1990) Memex machineXanadu Web search engines: locating and retrieving Web information: Web search engines Crawler-based (Google, MSN Search,…) Human-powered (Yahoo directory, Open Directory) Web is very dynamic: Dynamics of Web size Dynamics of Web pages Dynamics of Web link structure
4
4 Introduction (cont’) Dynamics of Web size: Almost anyone can publish almost anything on the Web at almost zero-cost Web size grows at an exponential rate Challenge for search engines: Scalability to cover a large part of the Web
5
5 Introduction (cont’) Dynamics of Web pages: Creation: new pages come into existence New information need to be captured by search engines Updates: content changes on a page (minor? major?) Search engines should keep the local pages to be fresh Deletion: existing pages cannot be found Search engines should detect deletions to avoid broken links
6
6 Introduction (cont’) Dynamics of Web link structure: Links are being established and removed constantly Important for search engines: Use the link structure to rank search results Eg: authoritative hubs
7
7 Introduction (cont’) Relationship between three dimensions Dynamics of Web size Dynamics of Web pages Dynamics of Web link structure Web P +1 page
8
8 Preliminary Search engine basic architecture: Web Search Engine CrawlerIndexerSearcher E End Users
9
9 Dynamics of Web Size Two categories of the Web: Indexable Web (shallow Web): Indexed by major engines More than four billion pages by late 2003 [Google] 8 billion in 2004, 20 billion in 2005,??? Now [Google] Non-indexable Web (deep Web): Pages hidden behind search forms, or with authorization requirements, etc. At least 400 times larger than indexable Web [Bergman00]
10
10 Web Size Study The Web is growing at an exponential rate Netcraft Web Server Survey Report Netcraft Web Server Survey Report (August 1995 – November 2004)
11
11 Search Engine Coverage Studies Bharat and Broder [1997]: Generate random URLs from a search engine Check whether these pages were in other engines Test on four search engines AltaVista, Excite, Infoseek, HotBot Estimated Web size: 200 million pages The overlap between engines was very small
12
12 Search Engine Coverage Studies Lawrence and Giles [1997]: Query-based sampling by scientists Test on six major search engines: AltaVista, Excite, Infoseek, HotBot, Lycos, and Northern Light Estimated Web size: 320 million pages Single engine coverage is limited: 34% Join coverage increases significantly: 60% Lawrence and Giles [1999]: Test on 11 search engines Estimated Web size: 320 million 800 million Single engine coverage: 34% 16%
13
13 Search Engine Coverage Studies Summary: StudyWeb SizeLargest Engine Join Coverage Bharat and Broder (1997) 200 millionAltaVista (50%) 80% Lawrence and Giles (1997) 320 millionHotBot (34%) 60% Lawrence and Giles (1999) 800 millionNorthern Light (16%) 42%
14
14 Impact on Search Engines – Scalable Architecture Google [Brin and Page 98]:Brin and Page 98 Data structure: Compact encoding and compression Distributed crawling system: Crawlers run in parallel Each crawler keeps hundreds of connections
15
15 Impact on Search Engines – Metasearch Engines Combine results of multiple engines to increase Web coverage Metasearch engine: Query Search Engine 1 Search Engine n ResultsQuery Final Results CrawlerIndexer Searcher Metasearch engine
16
16 Impact on Search Engines – Special-purpose Search Engines Not necessary to search the entire Web Special-purpose search engines: Focus on restricted domains Use focused crawler Start with relevant seed pages Score the extracted URLs according to relevance Pick up the URL with highest score to crawl P1P1 P2P2 P3P3 P4P4 P5P5 Priority queue P5P5 P4P4 P5P5 P2P2 P6P6 P7P7 P3P3
17
17 Dynamics of Web Pages – Characterize Updates Two measures [Lim02]: A Web page: an ordered sequence of words Distance Measure: The degree of change: [0, 1] Clusteredness Measure: How changes are spread out within a page: [0, 1] Changes are generally small and clustered An incremental update is more efficient for search engines
18
18 Impact of Web Page Dynamics on Search Engines A typical way to study page dynamics from a search engine perspective: 1.Develop a model for Web page changing 2.Propose update strategies to maximize the freshness for search engines Develop metrics to measure the freshness
19
19 Web Page Changing Model Studies – Poisson Process Model Each page P i is updated at an average rate λ i Poisson Process: X(t): the number of changes of page P in (0, t] Random variable X(s+t) – X(s) has Poisson probability distribution: for k = 0, 1, 2,…
20
20 Poisson Process – Brewington and Cybenko Study Combine the effects of page creation and updates into the Poisson Web model (α,β) – currency: Characterize how up-to-date a search engine is A page is β- current (β is a time unit) A search engine is (α,β) – current Pr (P is β- current) >= α T = f (α,β, λ) (0.95, 1 week) – currency: T = 18 days(800 million pages per day) t Now t - βt 0 Last observation β Grace period t 0 + T Re-indexing period T Grace period
21
21 Impact of Web Page Dynamics on Search Engines – Summary StudyCreationUpdatesDeletionFreshness Metric Brewington and Cybenko √√X(α,β) – currency Cho and Giacia-Molina X√Xfreshness age Edwards et al.√√X- Ntoulas, Cho and Olston √√√-
22
22 Dynamics of Web Link Structure – Web Link Structure Modeling Web link structure [Broder et al. 00] Four components: SCC (27.5%) IN (21.5%) OUT (21.5%) Tendrils and Tubes (21.5%) Others (8%)
23
23 Dynamics of Web Link Structure Study Only one existing study Ntoulas, Cho and Olston [04]: in one year Only 24% initial links were still available 25% new links created every week Link structure is more dynamic than pages (8% new pages and 5% new content in the same year!) Search engines should update link-based ranking metrics frequently
24
24 Link-based Ranking Metric – PageRank PageRank [WWW98]: main ranking metric of Google Definition: Page A has pages T 1 … T n (authoritative sites) pointing to it C(A): the number of links going out of page A d: damping factor in (0, 1) PR(A) = (1-d) + d(PR(T 1 )/C(T 1 ) + … + PR(T n )/C(T n ))
25
25 Incremental Update on PageRank Computations are too expensive Incrementally compute approximations to PageRank [Chien02] Basic ideas: Construct a subgraph of the Web Contain small neighborhood of link changes Model the rest of the Web graph as a single node Compute PageRank on this subgraph
26
26 Conclusions The Web is dynamic in three dimensions Serious challenges to search engines Search engines to cope with high dynamics Scalable architecture, intelligent scheduling strategies, efficient update algorithm for ranking metrics, etc Interesting to database people: Data representation dynamics: XML User dynamics: Adaptive search Deep Web dynamics: searchable? how? You should study COMP630L well References References
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.