Download presentation
Presentation is loading. Please wait.
1
The Fragmented Web Notes on Chapter 12 For In765 Judith Molka-Danielsen
2
1. Virtual robots Virtual robots read and index web pages. Would be hard to navigate without them. But, some pages are never mapped. Simple search engines can return too much. Meta-search engines select hits across engines. www.lib.berkeley.edu/TeachingLib/Guides/ Internet/MetaSearch.htmlwww.lib.berkeley.edu/TeachingLib/Guides/ Internet/MetaSearch.html
3
Steve Lawrence and C. Lee Giles Attempt to measure the Web in 1999 http://www.neci.nj.nec.com/homepages/lawrence/websize.html
4
2. Relevancy Finding the “best” page is more important than finding the “most” pages. Notes on Searching the Web: http://home.himolde.no/~molka/in350/week9y01. htm http://home.himolde.no/~molka/in350/week9y01. htm
5
Determining PageRank http://www.whitelines.nl/html/google-page-rank.html#example http://www.whitelines.nl/html/google-page-rank.html#example According to Sergey Brin and Lawrence (Larry) Page, Co-founders of Google, the PR of a webpage is calculated using this formula: PR(A) = (1 - d) + d * SUM ((PR(I->A)/C(I)) Where: –PR(A) is the PageRank of your page A. –d is the damping factor, usually set to 0,85. –PR(I->A) is the PageRank of page I containing a link to page A. –C(I) is the number of links off page I. –PR(I->A)/C(I) is a PR-value page A receives from page I. –SUM (PR(I->A)/C(I)) is the sum of all PR-values page A receives from pages with links to page A.. In other words: The PR of a page is determined by the PR of every page I that has a link to page A. For every page I that points to page A, the PR of page I is devided by the number of links from page I. These values are cumulated and multiplied by 0,85. Finally 0,15 is added to this result, and this number represents the PR of page A. What is your PageRank? http://www.klid.dk/pagerank.php?url=http://www.klid.dk/pagerank.php?url
6
by Greg R. Notess.Notess http://www.searchengineshowdown.com/stats/sizeest.shtml Search Engine Showdown Total Size Estimate (millions) Claim (millions) Google3,0333,083 AlltheWeb2,1062,112 AltaVista1,6891,000 WiseNut1,4531,500 Hotbot1,1473,000 MSN Search1,0183,000 Teoma1,015500 NLResearch733125 Gigablast275150 Data from:Dec. 31, 2002 Relative size: AlltheWeb reported size and percentages from relative size showdownrelative size showdown AlltheWeb:2,106,156,957 reported; Total Size reports are below.
7
Older Reports with Largest Three at that Time March 2002March 2002:Google, WiseNut, AlltheWeb August 2001August 2001:Google, Fast, WiseNut April 2001April 2001:Google, Fast, MSN (Inktomi) Oct. 2000Oct. 2000:Fast, Google, Northern Light July 2000July 2000:iWon, Google, AltaVista April 2000April 2000:Fast, AltaVista, Northern Light Feb. 2000Feb. 2000:Fast, Northern Light, AltaVista Jan. 2000Jan. 2000:Fast, Northern Light, AltaVista Nov. 1999Nov. 1999:Northern Light, Fast, AltaVista Sept. 1999Sept. 1999:Fast, Northern Light, AltaVista Aug. 1999Aug. 1999:Fast, Northern Light, AltaVista May 1999May 1999:Northern Light, AltaVista, Anzwers March 1999March 1999:Northern Light, AltaVista, HotBot January 1999January 1999:Northern Light, AltaVista, HotBot August 1998August 1998:AltaVista, Northern Light, HotBot May 1998May 1998:AltaVista, HotBot, Northern Light February 1998February 1998:HotBot, AltaVista, Northern Light October 1997October 1997:AltaVista, HotBot, Northern Light September 1997September 1997:Northern Light, Excite, HotBot June 1997June 1997:HotBot, AltaVista, Infoseek October 1996October 1996:HotBot, Excite, AltaVista
8
Search Engine Newest Page Found Rough Average Oldest Page Found MSN (Ink.)1 day4 weeks51 days HotBot (Ink.)1 day4 weeks51 days Google2 days1 month165 days AlltheWeb1 days1 month599 days** AltaVista0 days3 months108 days Gigablast45 days7 months 381 days Teoma41 days2.5 months81 days WiseNut133 days6 months183 days Freshness
9
Billions Of Textual Documents Indexed December 1995-September 2003 http://searchenginewatch.com/reports/article.php/2156481
10
3. URL’s are directed links. Andrei Broder (2000)
11
http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf Db driven/on-demand Static html
15
4. Defining Web based communities 15% of web pages have links to opposing views. 60% of web pages have links to like views. Social segmentation is self re-enforcing. Beliefs and affiliations have become public information represented in links and visits. Web based communities are hard to ID. No boundaries; different sizes; dif. organized. Pages with more internal links than outside links may be ID as a community. But, no efficient algorithm.
16
Other points… 5. Technology can allow more control over individuals: ID them, track them. Web topology (architecture by self-selecting where to link) limits our actions (browsing, some pages are invisible), more than the code (attempts at control, laws). 6. Internet Archive maintained since 1996 by Brewster Kahle. Some data will never go away. http://www.archive.org/ (Try the WayBack Machine.)http://www.archive.org/ 7. Web is complex and self-organized. They started by looking at the macrostructure. The last chapters will look at the smaller groupings.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.