Web Archaeology Raymie Stata Compaq Systems Research Center Raymie Stata Compaq Systems Research Center.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Search Engines and Information Retrieval
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Understanding and Managing WebSphere V5
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Databases & Data Warehouses Chapter 3 Database Processing.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Overview of SQL Server Alka Arora.
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Multimedia Databases (MMDB)
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Master Thesis Defense Jan Fiedler 04/17/98
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Búsqueda en SharePoint 2010: una introducción. Quick, easy, powerful search (for free!) Complete intranet search High-end search delivered through SharePoint.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Information Retrieval in Practice
Statistics Visualizer for Crawler
Hadoop.
Search Engine Architecture
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Overview of big data tools
Introduction to Information Retrieval
Presentation transcript:

Web Archaeology Raymie Stata Compaq Systems Research Center Raymie Stata Compaq Systems Research Center

What is Web Archaeology?  The study of the content of the Web –exploring the Web –sifting through data –making valuable discoveries  Difficult! Because the Web is: –Boundless –Dynamic –Radically decentralized

Some recent results  Empirical studies –Quality of almost-breadth-first crawling –Structure of the Web –Random walks (size of search engines)  Improving the Web experience –Better and more precise search results –Surfing assistants and tools  Data mining –Technologies for “page scraping”

Tools for “Web scale” research Apps Feature Databases Datastorage Data storage Data collection Use data Search quality Crawl quality Duplicate elimination Web characterization Access subset of data fast Full-text index, shingleprints Connectivity, Term vectors Store and access web pages Myriad Download web pages Mercator, Web Language

MercatorAtrax Web-scale crawling

The Mercator web crawler  A high-performance web crawler –Downloads and processes web pages  Written entirely in Java –Runs on any Java-capable platform  Extensible –Wide range of possible configurations –Users can plug in new modules at run-time

Mercator design points  Extensible –Well-chosen extensibility points –Framework for configuration  Multiple threads, synchronous I/O –vs. single thread, asynchronous I/O  Checkpointing –Allows crawls to be restarted –Modules export prepare, commit

System Architecture

Crawl quality

Atrax, a distributed version of Mercator  Distributes load across cluster of crawlers  Partitions data structures across crawlers  No central bottleneck  Network bandwidth is limiting factor

Performance of Atrax vs Mercator

Myriad -- new project  A very large, archival storage system –Scalable to a petabyte  With function shipping –Supports data mining

Myriad Requirements  Large (up to 10K disks)  Commodity hardware (low cost)  Easy to manage  Easy to use (queries vs. code)  Fault tolerance & containment –No backups, tape or otherwise

Two phases of Myriad project  Define service-level interface –Implemented to run on collections of files –Testing and tuning  Build scalable implementation –Cluster storage and processing –Designing now, prototype in summer –Won’t describe today

New service level interface Applications Storage service Blocks file systems, databases, Myriad  Better suited to this problem and scale  Supports “function shipping”

Myriad interface  Single table database  Stored vs. virtual columns –Virtual columns computed by injected code  Bulk input of new records  Management of code defining virtual columns  Output via select/project queries –User-defined code run implicitly –Support for repeatable random sampling

Example Myriad query [samplingprob=0.1, samplingseed= ] select name, length where insertionDate < Date(00/01/01) && mimeType == “text/html”;

Model for large-scale data mining  Step 1: make an extract –Do data-parallel select and project –Don’t do any sorts, joins, groupings  Step 2: put extract into high-power analysis tool –Joins, sorts, joins, groupings

Feature Databases  URL DB –URL  pgid  Host DB –pgid  hostid  Link DB –out: pgid  pgid* –in: pgid  pgid*  Term vector DB –pgid  term vector

URL database: prefix compression wi-us.com/~amigo/links/index.htm 11.emse.fr/ 13 tri.re.kr/~khshim/internet/bookmark.html 25 sw/bookmark 12 futuris.net/linen 29 /special/backiss.html Prefix compress

URL compression  Prefix compression –44  14.5 bytes/URL –Fast to decompress (~10  s)  ZIP compression –14.5  9.2 bytes/URL –Slow to decompress (~80  s)

Term vector basics Basic abstraction for information retrieval Useful for measuring “semantic” similarity of text A row in the above table is a “term vector” Columns are word stems and phrases Trying to capture “meaning”

Compressing term vectors  Sparse representation –Only store columns with non-zero counts  Lossy representation –Only store “important” columns –“Importance” determined by:  Count of term on page (high ==> important)  Number of pages with term (low ==> important)

TVDB Builder

ApplicationsApplications  Categorizing pages  Topic distillation  Filtering pages  Identifying languages  Identifying running text  Relevance feedback (“more like this”)  Abstracting pages

CategorizationCategorization  “Bulls take over”

How to categorize a page  Off line: –Collect training set of pages per category (~30K) –Combine training pages into category vectors  ~10K terms per category vector  On line: –Use term vector DB to look up vector of page –Find category vector that best matches this page vector  Use a Bayesian classifier to match vectors  Give no category if match not “definitive”

Topic drift in topic distillation  Some Web IR algorithms have this structure: –Compute a “seed set” on a query –Find neighborhood by following links –Rank this neighborhood  Topic drift (a problem): –The neighborhood graph includes off-topic nodes  “Download Microsoft Explorer”  MS Home page

Avoid topic drift with term vectors  Combine term vectors of seed set into topic vector  Detecting topic drift in neighboring nodes: –Combine topic vector with node’s term vector  Inner product works fine –Expunge or weight  Integration of feature databases helps!!

Link database  Goals –Fit links into RAM (fast lookup) –Build in 24 hours  Applications –Page ranking –Web structure –Mirror site detection –Related page detection

Link storage baseline design Id Links... Starts...

Link storage: deltas Id Starts Link Deltas

Link storage: compression Id Starts Link Deltas = 4 bits = 8 bits = 12 bits = 8 bits Variable-length encoding 1.7 bytes/link

LDBngLDBng

The future of Web Archaeology  Driving applications –Web search -- “finding things on the web”  Page classification (topic, community, type)  Purpose-specific search –Web “asset management” (what’s on my site?) –Automated information extraction (price robots)  Multi-billion page web  Dynamics