1 Datamining the Internet: Alexa Brewster Kahle President, Alexa Internet

Slides:



Advertisements
Similar presentations
MCB/Emerald. Name of service: Emerald License in place: country-wide for all university libraries, not-for-profit research and learning institutes within.
Advertisements

Surrey Libraries Computer Learning Centres January 2012 Internet Searching Teaching Script Totally New to Computers Internet Searching.
Computer Technology Timpview High School. A collection of local, regional, national, and international computer networks that are linked together to exchange.
Using SD K12 SharePoint ®. What is SharePoint? Microsoft SharePoint Components Web Browser Collaboration functions Process management modules Search modules.
Are You Smarter Than a 5 th Grader? 1,000,000 5th Grade Topic 1 5th Grade Topic 2 4th Grade Topic 3 4th Grade Topic 4 3rd Grade Topic 5 3rd Grade Topic.
2002 Rev. 2004Scott White Obtain Permission Obtaining Permission is the best way to ensure you are in copyright compliance. University of Texas Permission.
2/27/02Scott White Obtain Permission Obtaining Permission is the best way to ensure you are in copyright compliance. University of Texas Permission Guidelines.
Crawling the WEB Representation and Management of Data on the Internet.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Adaptive Book: A Platform for teaching, learning and student modeling Ananda Gunawardena School of Computer Science Carnegie Mellon University.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
How Search Engines Work Source:
The Walden's Paths Virtual Directories Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III Texas A&M University Structuring.
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
Internet Research Search Engines & Subject Directories.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Introduction to Computers Section 1A. home Definition of a Computer A computer is an electronic device used to process data, converting the data into.
Michael Krot, Data Manager and David Yakimischak, CTO
Improving Internet Surfing.  Company started in  Currently subsidiary of Amazon.com.  Main objective: provide information about almost every.
Communications Technology 2104 Mercedes Lahey. Bit 1. bit=From a shortening of the words “binary digit” 2. the basic unit of information for computers.
Svein Arne Brygfjeld National Library of Norway Nordic Web Archive.
How to Teach Social Studies Without a Textbook: Web Literacy and the Social Studies Classroom.
1 ITGS - introduction A computer may have: a direct connection to a net (cable); or remote access (modem). Connect network to other network through: cables.
Web Site Evaluation (or “What Makes a Good the Kenmore West High School Library Media Center.
WebInfoMall: the Chinese Web Archive how we got started and how it is now Huang Lianen and Li Xiaoming Peking University, China Digital Archive Workshop.
Searching the Web by Lorrie Brazier Revised by Paula Walton.
The Internet and World Wide Web
Technorati CI by Anissa Malady LIBR 282-Social Media for Competitive & Company Research S.Brown C. Confetti-Higgins.
Ben Fox BST10/2 nd Hour Ben Fox BST10/2 nd Hour
Principles of Web Design Da’Zheonna M. Cotton 2/8/13.
Finding What You’re Looking For Internet Search Tips.
Mass Digitization Projects Celebration and Challenges Presented to the 2 nd ICUDL Alexandria, Egypt by Dr. Gloriana St. Clair Carnegie Mellon University.
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
Dalhousie Libraries Digital Collections Migration from Joomla! to CQ5.
Computer Sharing Centre YouTube 4 December YouTube - Agenda 2  YouTube introduction  How to set up a YouTube (also Google) account  How to find.
 There is a Family History section on the BYU- I Library home page.  This site includes:  vital records for eastern and western states  Death indexes.
Million Book Project in U. S. and India International Conference on The Future of the Book April 22, 2003 Gloriana St. Clair Carnegie Mellon University.
Information Design Trends Unit Five: Delivery Channels Lecture 2: Portals and Personalization Part 2.
Evaluating Web Pages Techniques to apply and questions to ask.
G053 - Lecture 02 Search Engines Mr C Johnston ICT Teacher
Network Communication & Collaboration What we like.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
BAB 5 Web Planning. Purpose Objective of the site Who is our target audience? Who will looking for our site? Who will become interested if they reach.
Archiving & Preserving Digital Content
Education 499-R01 Search Basics.
WordPress Fall 2017.
Multimedia Hypermedia What is It, How Do I Use It?
Search Engine Optimisation
BAB 5 Web Planning.
Searching the Internet
Challenges and Opportunities of Archiving the UK Web
Chapter 25 - Automated Web Search (Search Engines)
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Step In Business Hub Pvt Ltd Step In Business Hub Pvt. Step In Business Hub Pvt. Ltd, started in the year 2013, and is recognized as one of the fastest.
Webdesigningpune.net Webdesigningpune.net Ltd, started in the year 2013, and is recognized as one of the fastest growing and most experienced Digital Marketing.
Search Engines & Subject Directories
ثانيا :أدوات البحث عبر الانترنت
What is a Search Engine EIT, Author Gay Robertson, 2017.
HTML Links.
Search Engines & Subject Directories
Search Engines & Subject Directories
Brewster Kahle Director Internet Archive
Use an Internet Browser
Advance Web Sites.
Searching the Internet
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
Do I Understand the Internet?
Lesson 2: Gathering and Organizing Information Using ICT KEY QUESTION: HOW DO YOU GATHER AND ORGANIZE INFORMATION USING THE COMPUTER AND INTERNET?
Presentation transcript:

1 Datamining the Internet: Alexa Brewster Kahle President, Alexa Internet

2 To Answer Any Question... F Know a lot F Know what is important F Be right enough Alexa: The web navigation service that learns from people

3 Know Alot: Other Repositories F Library of Alexandria: 800GB (400k F Library of Congress: 20TB (20M books, ascii) F Dialog Information Service: 3-5TB F Video Store: 8TB (5k videos, 1GB/hr) F Public Branch Library: 3TB (300k scanned books) F Radio Station: 1TB (15k hrs of music) F... Alexa’s Internet Archive: 10TB

4 Know A lot: Gathering F Web Snapshot on T3 in 20 days F User’s Paths essential as well

5 8 Terabytes so far

6 Web Stats F 1million sites, doubling every 6 months (millions of authors) F More videos, dynamic pages, Java etc. F 15 links on each page

7 Storage Snapshot of the Web on Tape Jukebox costs $80k

8 Knowing what is Important: Mining the WWW for Quality F Content: 100 million pages F Link Structure: 750 million links F Usage paths: many 100 million hits

9 Be Right Enough: being useful F Competition –Directories: u Biggest only links to < 1% of the WebPages –Search Engines: u Returning 1000’s of hits (sometimes millions) F Trends: –Move to “channels” of less content, but good –limit crawling (50M pages and holding)

10 Be Right Enough: Alexa F Where am I? F Where do I want to go? F Alexa: F “Can I trust this information” F What should I look at next?

11

12

13

14 Travel Agents

15 Conde Naste Travel

16 Ford Vehicles Homepage

17 Ford’s Mustang Page

18 Independent Mustang Page

19 Surrealism Page

20 Women Surrealists

21 Archive in action

22 Alexa Conclusion