WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
Architecture of a Search Engine
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The World of Google What’s the impact of Google? Was Google lucky or smart? What were some of the big ideas that made Google successful? What would you.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Evaluating IR (Web) Systems Study of Information Seeking & IR Pragmatics of IR experimentation The dynamic Web Cataloging & understanding Web docs Web.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
Information Retrieval CSE 8337 Spring 2005 Web Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and.
User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
Federated & Meta Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
WIRED Week 2 Syllabus Update Readings Overview.
Search Engine 101 Qu, Miao Nov
Search Search Engines Search Engine Optimization Search Interfaces
Data Mining Chapter 6 Search Engines
Web Search Engines.
Presentation transcript:

WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion - Idea Pitch - Group Formation

Web makes IR an everyday activity Search Engines Search Interfaces The openness of the Web changes everything - Access - Technological progress - Expectation - Credibility - Networks and Networking

How Much Information Out there? UC Berkeley project Center for the Digital Future Pew Internet & American Life Project What kinds of information is it? What formats? - Information = Web pages? - Now - Future Who creates it? Why do they publish it? Content and Context

Investigation of Web Documents Are Web documents different? - Structure - HTML & other markup Common tags - Content - “information” & commerce Readability Usability - Context - “sociological insights”, & spam Links - Interest - topics, titles, keywords, file types - Interface - browsers (& crawlers) Older study, what’s new? - More multimedia - XHTML & XML - AJAX, REST, SOAP, Web 2.0?

Statistical Profiles of Highly-rated Web sites A Quality Checker - Good design makes better Web pages Look at popular pages & see what makes them popular We know good pages when we see (use) them - Different types of Web page sturctures Elements - Text, links & graphics (& their formatting) - Accessibility, Size, errors, nav links (scent) - Architecture of site What makes these pages good for searching?

Content - Organizing & Accessing Distributed Data(base) Dynamic Data - Mobile - Ephemeral Huge Volume Unstructured and Redundant Quality Heterogeneous - Languages - Code pages

Measuring the Web How would you measure? - Size (crawling) - Surveys - Hits & Metering - Bandwidth use What do numbers mean? - Number of Hosts? - Number of Sites? - Number of Pages? Accurate +/- a lot

The Web is a Bowtie? Structure - pass from any node of IN through SCC to any node of OUT - hanging off IN & OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC - a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE - a passage from a portion of IN to a portion of OUT without touching SCC. Broder, et. al 2000Broder,

Web Search Engines Independent of IR model Distributed index and servers - Crawler - Query server - Indexer Crawlers and Spiders - Centralized control, Coordinated, Refresh, Filtering - Not the main problem Queries - Interface, processing, results Indexing - Data normalization, load balancing, data sharing

Harvesting Not just Web data - Caching, Duplication, Normalization Armies of crawlers Filtering collected data Gatherers - Collects and extracts on various schedules - Works with several brokers Brokers - Indexes and interfaces to queries - Works with other Brokers and Gatherers Topical Agents?

Web Crawling Issues Follow chains of URLs to gather more URLs Extract index (content) from each page Lather-Rinse-Repeat Update crawler to-do list Associate frequency of crawls Breadth or Depth first? Endless looping Duplicate pages/sites Changed page (or not really?) Dynamically generated pages Intranet pages Markup language getting in the way NOROBOTS What should a crawler get?

Indexing the Web Inverted File Index - Sorted words with pointers to location(s) & page(s) - Pointers are the focus (inversion) What about pages and sites? - Massive redundancy on well-organized sites Navigation Topics Content “State of the art indexing techniques” = 30% of text (not page) size. p 383 How can you tune an index for massively changing documents?

Ranking Boolean and Vector models mostly used - Why? - Works from the index, not the text Which ranking methods are best? - Datasets - Syntaxes - Users & Testing

Ranking Methods TF-IDF - Simple, smaller data sets Boolean Spread - Degrees of match - Within a document - Set of documents - Links between documents (meta docs?) Vector Spread - Standard cosine between query and index (to document) - Links with answer or pointing to answer Most Cited

Is Web ranking different? Links are the difference that makes the difference - Internal links on a page - Internal links on a site - Relationships between sites - Link freshness Kleinberg’s HITS method (1998) - Hypertext Induced Topic Search - Number of pages that point to (processed) query - Authorities (relevant content by links) - Hubs (links to varied authorities)

Problems with Hubs & Authorities Is more links always better? What about pages without many outgoing links? How do you count multiple links from within one page to another? Do automatically generated sites/pages have an advantage? - CMS systems may have linking “fingerprints” - Metadata How varied are the link weights? - Simple counts - Modified by other IR measures

Anatomy of a LS Web Search Engine Initial Google Design PageRank - PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) - “A model of user behavior” probability of a random surfer visiting a page is its PageRank + a damping factor (boredom) - Pages point to a page - Highly ranked pages point to a page - Anchor text is mined (the label for the link) - Proximity included

Anatomy 2 Repository of page content Document index - Forward (sorted) - Inverted (sorter) Lexicon of words & pointers Hit Lists of word occurrence(s) Crawlers Ranking Feedback of selection (~)

Popularity? Do you always want the most popular information source? - Talk Radio - New York Times Bestseller List - “Lincoln’s Doctors Dog” - “The C.S.I. Diet and Cookbook” Trend or Fad? Blogs, Editorials and Propaganda vs. “Facts”? Result Diversity Death of the Mid-List

Next Generation Web Search Search works well now (80%), but what’s next? We need to be user-focused, not data-focused How do we match search to the task? - Is it all about speed? - How could metadata support search tasks? Best search is browsing? - Faceted Search? Faceted Search - Suggesting = browsing for interfaces Cooking Related results Specialized interfaces Natural language queries (quesiton answering) “Real world” metadata Context, personalization, query specifics

Metasearch Issues One place for everything? First or Last place to look? Better or different interface? Combined, sorted results would be best - How to sort? - Sorting for different types of queries Syntax Errors State Information (monitoring) Copyright issues (robots) User, content and interface mismatches/challenges

Web Searching Metaphors How do people visualize the Web? Is Browsing better? Do we need new metaphors for using the Web? - Searching - Browsing - What else?

Assignments Read weekly Primary Readings & Participate in class discussions 10% - 1 page summaries Re-design Search Results interface 10% Web (log) analytics 20% Future of Search (“Google 2010”) (5 page paper) 10% Web Information Retrieval System Evaluation & Presentation 20% Main Project or Paper 30%

Re-design Search Results interface Choose a search engine (not Google) and re-design the query AND result page interfaces - Snap, Live, Ask, Technorati, Clusty, & many others… Discuss what search features are and their interfaces - Highlight the good & the bad (or hard to understand or use) - Use your own perspective as a novice user or habitual user of the search engine Sketch, Photoshop &/or re-build the HTML pages to show your improved interface designs - Explain why you made the interface (& feature) changes - Illustrate how people would use the new interface Compare to other search engines or search tools & interfaces to give context to your re-design

System Evaluation & Presentation - 5 page written evaluation of a Web IR System - technology overview (how it works) - a brief history of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use & its overall effectiveness

Future of Search paper How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications New interfaces for applications New datasets for applications