1 Datamining the Internet: Alexa Brewster Kahle President, Alexa Internet
2 To Answer Any Question... F Know a lot F Know what is important F Be right enough Alexa: The web navigation service that learns from people
3 Know Alot: Other Repositories F Library of Alexandria: 800GB (400k F Library of Congress: 20TB (20M books, ascii) F Dialog Information Service: 3-5TB F Video Store: 8TB (5k videos, 1GB/hr) F Public Branch Library: 3TB (300k scanned books) F Radio Station: 1TB (15k hrs of music) F... Alexa’s Internet Archive: 10TB
4 Know A lot: Gathering F Web Snapshot on T3 in 20 days F User’s Paths essential as well
5 8 Terabytes so far
6 Web Stats F 1million sites, doubling every 6 months (millions of authors) F More videos, dynamic pages, Java etc. F 15 links on each page
7 Storage Snapshot of the Web on Tape Jukebox costs $80k
8 Knowing what is Important: Mining the WWW for Quality F Content: 100 million pages F Link Structure: 750 million links F Usage paths: many 100 million hits
9 Be Right Enough: being useful F Competition –Directories: u Biggest only links to < 1% of the WebPages –Search Engines: u Returning 1000’s of hits (sometimes millions) F Trends: –Move to “channels” of less content, but good –limit crawling (50M pages and holding)
10 Be Right Enough: Alexa F Where am I? F Where do I want to go? F Alexa: F “Can I trust this information” F What should I look at next?
11
12
13
14 Travel Agents
15 Conde Naste Travel
16 Ford Vehicles Homepage
17 Ford’s Mustang Page
18 Independent Mustang Page
19 Surrealism Page
20 Women Surrealists
21 Archive in action
22 Alexa Conclusion