How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine Marti Hearst SIMS SIMposium, April 21, 1999.

Slides:



Advertisements
Similar presentations
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Problem Solving Agents A problem solving agent is one which decides what actions and states to consider in completing a goal Examples: Finding the shortest.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
Information Retrieval in Practice
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Crawling the WEB Representation and Management of Data on the Internet.
CS Lecture 9 Storeing and Querying Large Web Graphs.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
By Morris Wright, Ryan Caplet, Bryan Chapman. Overview  Crawler-Based Search Engine (A script/bot that searches the web in a methodical, automated manner)
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Blind Search-Part 2 Ref: Chapter 2. Search Trees The search for a solution can be described by a tree - each node represents one state. The path from.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
8/28/2001Information Organization and Retrieval SIMS 202 Information Organization and Retrieval Prof. Ray Larson & Prof. Warren Sack UC Berkeley SIMS Tues/Thurs.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999.
Overview of Search Engines
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Databases & Data Warehouses Chapter 3 Database Processing.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Lecturer: Ghadah Aldehim
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
WAES 3308 Numerical Methods for AI
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Lecture 3: Uninformed Search
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Internal and External Sorting External Searching
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Search can be Your Best Friend You just Need to Know How to Talk to it IW 306 Ágnes Molnár.
Data mining in web applications
Information Retrieval in Practice
B+ Tree.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Chapter 27 WWW and HTTP.
Data Mining Chapter 6 Search Engines
Indexing 4/11/2019.
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
B-Trees.
Presentation transcript:

How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine Marti Hearst SIMS SIMposium, April 21, 1999

This Talk Overview of goals System implementation details Not –UI evaluation –related work –etc

People Principles: Mike Chen and Marti Hearst Early coding: Jason Hong Early UI evaluation: Jimmy Lin, Mike Chen Current UI evaluation: Shiang-Ling Chen

Cha-Cha Goals Better Intranet search –integrate searching and browsing –provide context for search results –familiarize users with the site structure UI –minimal browser requirement widely usable HTML interface –build on user familiarity with existing systems

Intranet Search Documents used in a large, diverse Intranet, e.g., University.edu Corporation.com Government.gov Hypothesis: It is meaningful to group search results according to organizational structure

Searching Earthquakes at UCB: Standard Way

Searching Earthquakes at UCB with Cha-Cha

Cha-Cha and Source Selection Shows available sources Sources are major web sites User may want to navigate the source rather than go directly to the search hits Gives hints about relative importance of various sources Reveals the structure of the site while tightly integrating this structure with search Users tell us anecdotally that the outline view is useful for finding starting points

System Overview Collect shortest paths for each page. –Global paths: from root of the domain –Local paths: from root of the server –Select “the best” path based on the query User interaction with the system: Cha-ChaCheshire 1. query2. query 3. hits 4. select paths & generate HTML 5. HTML

Current Status Over 200,000 pages indexed About 2500 queries/weekday Less than 3 sec/query on average Five subdomains using it as site search engine –eecs millennium project –sims –law –career center

Cha-Cha Preprocessing

Overview of Cha-Cha Preprocessing Crawl entire Intranet –Store copies of pages locally –200,000 pages on the UCB Intranet Revisit all the pages again (on disk) –Create metadata for each page –Compute the shortest hyperlink path from a certain root page to every web page both global and local paths Index all the pages –Using Cheshire II (Ray Larson, SIMS) –Index full text, titles, shortest paths separately

Web Crawling Algorithm Start with a list of servers to crawl –for UCB, simply start with Restrict crawl to certain domain(s) –*.berkeley.edu Obey No Robots standard Follow hyperlinks only –do not read local filesystems links are placed on a queue traversal is breadth-first

Web Crawling Algorithm (cont.) Interpret the HTML on each web page Record the text of the page in a file on disk. –Make a list of all the pages that this page links to (outlinks) –Follow those links one at a time, repeating this procedure for each page found, until no unexplored pages are left. links are placed on a queue traversal is breadth-first urls that have been crawled are stored in a hash table in memory, to avoid repeats

Custom Web Crawler Special considerations –full coverage web search engines don’t go very deep web search engines skip problematic sites –search on “Berdahl” at snap: 430 hits –search on “Berdahl” on Cha-Cha: XXX hits –solution tag each URL with a retry counter if server is down, put URL at the end of the queue and decrement the retry counter if the counter is 0, give up on the URL

Custom Web Crawler Special considerations –servers with multiple names info.berkeley.edu == –solution: hash the home page of the server into a table whenever a new server is found, compare its homepage to those in the table if a duplicate, record the new server’s name as being the same as the original server’s

Cha-Cha Metadata Information about web pages –Title –Length –Inlinks –Outlinks –Shortest paths from a root home page

Metafile Generator Main task: find shortest path information –Two passes: global and local Global pass: –start with main home page H ( –find shortest path from H to every page in the system for each page, keep track of how far it is from H also keep track of the path that got you there store this information in a disk-based storage manager (we use sleepycat, based on Berkeley db) if a page is re-encountered using a path with a shorter distance, record that distance and the new path –when this is done, write out a metafile for each page

Metafile Generator (cont.) Local pass: –start with a list of all the servers found during the crawl –for each server S find shortest path from S to every page in the system do this the same way as in the global pass but store the results in a different database when done, write out a metafile for each page, in a different directory than for the global pass

Metafile Generator (cont.) Combine local and global path information Purpose: –locality should “trump” global paths, but not all local pages are reachable locally –example: the shortest path from to is: -> search.berkeley.edu -> cha- cha.berkeley.edu -> but we want my home page to be under the SIMS faculty listing solution: let local trump global –example:

Metafile Generator (cont.) Combine local and global path information How to do it: –go through the metafiles in the global directory –for each metafile if there already is a metafile for that url in the local directory, skip this metafile otherwise (there is not metafile for this url locally) copy the metafile into the local directory Why not just use local metafiles? –some pages are not linked to within their own domain e.g., student association hosted within a particular student’s domain

Sample Cha-Cha Metadata file Welcome to SIMS null

Cha-Cha Metadata File, cont. 2 1 Welcome to UC Berkeley UC Berkeley Teaching Units 0 /projects/cha-cha/development/data/done/text/

CHESHIRE II Search back-end for Cha-Cha –Ray Larson et al. ASIS 95, JASIS 96 CHESHIRE II system: Full Service Full Text Search Client/Server architecture Z39.50 IR protocol Interprets documents written in SGML Probabilistic Ranking Flexible data representation

CHESHIRE II (cont.) A big advantage of Cheshire: –don’t have to write a special parser for special document types –instead, simply create one DTD and the system takes care of parsing the metafiles for us A related advantage: –can create indexes on individual components of the document allows efficient title search, home page search, domain-based search, without extra programming

Cha-Cha Document Type Definition <!SGML "ISO 8879:1986" -- CHARSET BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED UNUSED UNUSED UNUSED BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET UNUSED UNUSED

Cha-Cha DTD, cont. (parts omitted) <!doctype METADATA [ <!-- This is a DTD for metadata records extracted from the HTML files in the cha-cha system. The tagging is simple with nothing particular about it. The structure has been kept flat within the individual records. The only somewhat interesting thing is the TEXT-REF tag which is used to contain a reference to the full text of entry stored in the raw HTML form. -->

<!ELEMENT METAFILE - - (URL, TITLE, DATE, SIZE, INLINKCOUNT, INLINKS, OUTLINKCOUNT, OUTLINKS, DEPTH?, SHORTESTPATHSCOUNT?, SHORTESTPATHS?, MIRRORCOUNT?, MIRRORURLS?, TYPE?, DOMAIN?, FILE?)> Cha-Cha DTD, cont. (parts omitted)

Cha-Cha Online Processing

Responding to the User Query User searches on “pam samuelson” Search Engine looks up documents indexed with one or both terms in its inverted index Search Engine looks up titles and shortest paths in the metadata index User Interface combines the information and presents the results as HTML

Building the Outline View Main issue: how to combine shortest paths –There are approximately three shortest paths per web page –We assume users do not want to see the page multiple times Strategy: –Group hits together within the hierarchy –Try to avoid showing subhierarchies with singleton hits This assumption is based on part on evidence from our earlier clustering research that relevant documents tend to cluster near one another

Building the Outline View (cont.) Goals of the algorithm: –(I) Group (recursively) as many pages together within a subhierarchy as possible Avoid (recursively) branches that terminate in only one hit (leaf) –(II) Remove as many internal nodes as possible while while stil retaining at least one valid path to every leaf –(iii) Remove as many edges as possible while retaining at lesat one path to every leaf

Building the Outline View (cont.) To achieve these goals we need a non- standard graph algorithm –To do it properly, every possible subset of nodes at depth D should be considered to determine the minimal subset which covers all nodes at depth D+1 –This is inefficient -- would require 2^k checks for k nodes at depth D Instead, we use a heuristic approach which approximates the optimal results

Building the Outline View (cont.) First, a top-down pass –record depth of each node and the number of children it links to directly Second, a bottom-up pass –identify the deepest nodes (the leaves) –D <- the set of nodes that are parents of leaves –Sort D ascending according to how many active children they link to at depth D+1 –A node is active if it has not been eliminated

Building the Outline View (cont.) Bottom-up pass, continued –every node is a candidate to be eliminated –those nodes with the least number of children are eliminated first because of goal (I) –for each candidate C, if C links to one or more active nodes at depth D+1 that are not covered by any active nodes, then C cannot be eliminated. Otherwise, C is removed from the active list After a level D is complete, there are no active nodes at depth D that cover exclusively nodes that are also covered by another node at depth D

Building the Outline View (cont.) Retaining rank ordering –Build up the tree by first placing in the tree the hit (leaf) that is highest ranked –As more leaves are added, more parts of the hierarchy are added, but the order in which the parts of the hierarchy are added is retained When the hierarchy has been built, it is traversed to create the HTML listing

Summary Better user interfaces for search should: –Help users understand starting points/sources –Places results of search into an organizing context One (of many) approaches –Cha-Cha: simultaneously browse and search intranet site context Future work –Special handling for short queries –Spelling corrections suggestions –Smarter paths