Advanced Internet Systems CSE 454 Daniel Weld. To do Add picture of original MT Add greg little or casting words flowchart Discussion included qualifications,

Slides:



Advertisements
Similar presentations
The Internet.
Advertisements

LABELING IMAGES LUIS VON AHN CARNEGIE MELLON UNIVERSITY.
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
THE ESP GAME, & PEEKABOOM LUIS VON AHN CARNEGIE MELLON UNIVERSITY.
Crowdsourcing 04/11/2013 Neelima Chavali ECE 6504.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.
Lecture 26: Vision for the Internet CS6670: Computer Vision Noah Snavely.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Internet Enabled Human Computation CSE 454 Daniel Weld.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Advanced Internet Systems CSE 454 Daniel Weld. Project Presentations - Monday 10:30 - BestBet 10:45 - WikiTruthiness 11:00 - TwitEvents 11:15 - Freshipes.
CAPTCHA & THE ESP GAME SHAH JAYESH CS575SPRING 2008.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Copyright © D.S.Weld8/8/2015 9:34 PM1 CSE Case Studies Design of Alta Vista Based on a talk by Mike Burrows.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Creating a Web Page HTML, FrontPage, Word, Composer.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Notes to Teachers 1.Northstar digital literacy assessment or an alternate assessment should be done at the start of each new unit. To access the assessments,
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Mrs. Beth Cueni Carnegie Mellon
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Tutorial 1: Browser Basics.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Crawling Slides adapted from
Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.
Lesson 1 What Is the World Wide Web?. Objectives Upon completion of this lesson, you should be able to: Explain what the World Wide Web is and how it.
Copyright © Weld /14/2015 7:25 PM1 CSE 454 Crawlers & Indexing.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Copyright © D.S.Weld10/25/ :38 AM1 CSE 454 Index Compression Alta Vista PageRank.
ITCS373: Internet Technology Lecture 5: More HTML.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Human Computation (aka Crowdsourcing) LUIS VON AHN Slides taken from a talk by.
1 More About HTML Images and Links. 22 Objectives You will be able to Include images in your HTML page. Create links to other pages on your HTML page.
Machine Learning & Datamining CSE 454. © Daniel S. Weld 2 Project Part 1 Feedback Serialization Java Supplied vs. Manual.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CSE 454 Indexing. Todo A bit repetitive – cut some slides Some inconsistencie – eg are positions in the index or not. Do we want nutch as case study instead.
THE ESP GAME, AND OTHER STUFF
Advanced Internet Systems
Chapter 8 Browsing and Searching the Web
Search Engines and Search techniques
Electronic Communication
CSE 454 Crawlers 9/22/ :09 PM.
Crawlers: Nutch CSE /12/2018 5:08 AM.
Mrs. Beth Cueni Carnegie Mellon
Based on a talk by Mike Burrows
Based on a talk by Mike Burrows
CSE 454 Crawlers & Indexing.
Educational Computing
Based on a talk by Mike Burrows
Presentation transcript:

Advanced Internet Systems CSE 454 Daniel Weld

To do Add picture of original MT Add greg little or casting words flowchart Discussion included qualifications, contracts,

CSE 454 Overview Inverted IndiciesHTTP, HTML, Scaling & Crawling Cryptography & Security

Transfer of Confidential Data You (client)Merchant (server) I want to make a purchase. Here is my RSA public key E and a cert and a “nonce”. My Credit Card and your nonce are E(“ ”, nonce).

Cyptography Symmetric + asymmetric ciphers Stream + block ciphers; 1-way hash Z=Y X mod N Z=Y X mod N DNS, HTTP, HTML Get, put, post Get, put, post Cookies, log file analysis Cookies, log file analysis

Structure of Mercator Spider 1. Remove URL from queue 2. Simulate network protocols & REP 3. Read w/ RewindInputStream (RIS) 4. Has document been seen before? (checksums and fingerprints) 5. Extract links 6. Download new URL? 7. Has URL been seen before? 8. Add URL to frontier Document fingerprints

Common Types of Clusters Simple Web Farm Search Engine Cluster Inktomi (2001) Supports programs (not users) Persistent data is partitioned across servers:  capacity, but  data loss if server fails Layer 4 switches From: Brewer Lessons from Giant-Scale Services

CSE 454 Overview Inverted Indicies Search Engines HTTP, HTML, Scaling & Crawling Cryptography & Security

The Precision / Recall Tradeoff Precision Proportion of selected items that are correct Recall Proportion of target items that were selected Precision-Recall curve Shows tradeoff tn fptpfn Tuples returned by System Correct Tuples Recall Precision AuC

10 Vector Space Representation Dot Product as Similarity Metric TF-IDF for Computing Weights w ij = f(i,j) * log(N/n i ) Where q = … word i … N = |docs| n i = |docs with word i | How Process Efficiently? documents terms q djdj t1t1 t2t2 Copyright © Weld

2/25/2016 1:34 PM11 Thinking about Efficiency Clock cycle: 2 GHz Typically completes 2 instructions / cycle ~10 cycles / instruction, but pipelining & parallel execution Thus: 4 billion instructions / sec Disk access: 1-10ms Depends on seek distance, published average is 5ms Thus perform 200 seeks / sec (And we are ignoring rotation and transfer times) Disk is 20 Million times slower !!! 20 Million times

12 Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document LEXICON OCCURENCE INDEX One method. Alta Vista uses alternative … Copyright © Weld

13 How Inverted Files are Created Crawler Repository Scan Forward Index Sort Sorted Index Scan NF (docs) Lexicon Inverted File List ptrs to docs Copyright © Weld

2/25/2016 1:34 PM14 AltaVista Basic Framework Flat 64-bit address space Index Stream Readers: Loc, Next, Seek, Prev Constraints Let E be ISR for word enddoc Constraints for conjunction a AND b prev(E)  loc(A) loc(A)  loc(E) prev(E)  loc(B) loc(B)  loc(E) b a e b a b e b prev(E) loc(A) loc(E)loc(B)

CSE 454 Overview Inverted Indicies Adverts Search Engines HTTP, HTML, Scaling & Crawling Cryptography & Security

A-B testing Nine Changes in Site Above

CSE 454 Overview Inverted Indicies Supervised Learning Information Extraction Web TablesParsing & POS Tags Adverts Search Engines HTTP, HTML, Scaling & Crawling Cryptography & Security Open IE

Landscape of IE Techniques: Models Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Slides from Cohen & McCallum Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Any of these models can be used to capture words, formatting or both.

What is Open Information Extraction?

Self-Supervised Learning from WIkipedia [Wu et.al. CIKM’07] Ben is living in Paris. Extractor (~60-90% precision)

Cool UIs (Zoetrope & Revisiting) CSE 454 Overview Inverted Indicies Supervised Learning Information Extraction Web TablesParsing & POS Tags Adverts Search Engines HTTP, HTML, Scaling & Crawling Cryptography & Security Open IE Human Comp

Self-Supervised Learning from WIkipedia [Wu et.al. CIKM’07] Ben is living in Paris. Extractor (~60-90% precision)

Popup Interface

How Motivate People to Help? Pay them…

Powerset

Your sentence is: The term silver dollar is often used for any large white metal coin issued by the United States with a face value of one dollar ; although purists insist that a dollar is not silver unless it contains some of that metal. Enter one term per box. $0.05

Fast & Cheap, but is it Good? [Snow et al. EMNLP-08]

How Cheap + Fast? In our experiment we ask for 10 annotations each of the full 30 word pairs, at an offered price of $0.02 for each set of 30 annotations (or, equivalently, at the rate of 1500 annotations per USD). The most surprising aspect of this study was the speed with which it was completed; the task of 300 annotations was completed by 10 annotators in less than 11 minutes … 1724 annotations / hour. [Snow et al. EMNLP-08]

Who are those Turkers?

Motivating People Money Fun

IMAGE SEARCH ON THE WEB USES FILENAMES AND HTML TEXT Slides by Luis von Ahn

ACCESSIBILITY LESS THAN 10% OF THE WEB IS ACCESSIBLE TO THE VISUALLY IMPAIRED REASON:MOST IMAGES DON’T HAVE A CAPTION Slides by Luis von Ahn

LABELING IMAGES WITH WORDS STILL A COMPLETELY OPEN PROBLEM FACE MAN SUPER SEXY Slides by Luis von Ahn

DESIDERATA A METHOD THAT CAN LABEL ALL IMAGES ON THE WEB FAST AND CHEAP Slides by Luis von Ahn

TWO-PLAYER ONLINE GAME PARTNERS DON’T KNOW EACH OTHER AND CAN’T COMMUNICATE OBJECT OF THE GAME: TYPE THE SAME WORD THE ONLY THING IN COMMON IS AN IMAGE THE ESP GAME Slides by Luis von Ahn

PLAYER 1PLAYER 2 GUESSING: CARGUESSING: BOY GUESSING: CAR SUCCESS! YOU AGREE ON CAR SUCCESS! YOU AGREE ON CAR GUESSING: KID GUESSING: HAT THE ESP GAME Slides by Luis von Ahn

© 2004 Carnegie Mellon University, all rights reserved. Patent Pending. Slides by Luis von Ahn

MANY PEOPLE PLAY OVER 20 HOURS A WEEK 3.2 MILLION LABELS WITH 22,000 PLAYERS THE ESP GAME IS FUN Slides by Luis von Ahn

LABELING THE ENTIRE WEB INDIVIDUAL GAMES IN YAHOO! AND MSN AVERAGE OVER 10,000 PLAYERS AT A TIME 5000 PEOPLE PLAYING SIMULTANEOUSLY CAN LABEL ALL IMAGES ON GOOGLE IN 30 DAYS! Slides by Luis von Ahn

9 BILLION MAN-HOURS OF SOLITAIRE WERE PLAYED IN 2003 EMPIRE STATE BUILDING PANAMA CANAL 7 MILLION MAN-HOURS (6.8 HOURS OF SOLITAIRE) 20 MILLION MAN-HOURS (LESS THAN A DAY OF SOLITAIRE) Slides by Luis von Ahn

GWAP Problem?

Motivating People Money Fun Altruism Esteem Self-Interest

Altruism Self-Esteem

Motivating People Money Fun Altruism Esteem Self-Interest

Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference Which German Scientists Taught at US Universities? … Albert Einstein was a German- born theoretical physicist … … Einstein was a guest lecturer at the Institute for Advanced Study in New Jersey … New Jersey is a state in the Northeastern region of the United States …

Next-Generation Search Information Extraction … Ontology Physicist (x)  Scientist(x)… Inference Einstein = Einstein… Scalable Means Self-Supervised … Albert Einstein was a German- born theoretical physicist … … New Jersey is a state in the Northeastern region of the United States …