WIRED Week 4 Syllabus Review Readings Overview - Web IR Chapter - Brin & Page - Google - Kobayashi & Takeda – Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion - Idea Pitch - Group Formation
Web makes IR an everyday activity Search Engines Search Interfaces The openness of the Web changes everything - Access - Technological progress - Expectation - Credibility - Networks and Networking
How Much Information Out there? UC Berkeley project Center for the Digital Future Pew Internet & American Life Project What kinds of information is it? What formats? - Information = Web pages? - Now - Future Who creates it? Why do they publish it? Content and Context
Investigation of Web Documents Are Web documents different? - Structure - HTML & other markup Common tags - Content - “information” & commerce Readability Usability - Context - “sociological insights”, & spam Links - Interest - topics, titles, keywords, file types - Interface - browsers (& crawlers) Older study, what’s new? - More multimedia - XHTML & XML - AJAX, REST, SOAP, Web 2.0?
Statistical Profiles of Highly-rated Web sites A Quality Checker - Good design makes better Web pages Look at popular pages & see what makes them popular We know good pages when we see (use) them - Different types of Web page sturctures Elements - Text, links & graphics (& their formatting) - Accessibility, Size, errors, nav links (scent) - Architecture of site What makes these pages good for searching?
Content - Organizing & Accessing Distributed Data(base) Dynamic Data - Mobile - Ephemeral Huge Volume Unstructured and Redundant Quality Heterogeneous - Languages - Code pages
Measuring the Web How would you measure? - Size (crawling) - Surveys - Hits & Metering - Bandwidth use What do numbers mean? - Number of Hosts? - Number of Sites? - Number of Pages? Accurate +/- a lot
The Web is a Bowtie? Structure - pass from any node of IN through SCC to any node of OUT - hanging off IN & OUT are TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC - a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE - a passage from a portion of IN to a portion of OUT without touching SCC. Broder, et. al 2000Broder,
Web Search Engines Independent of IR model Distributed index and servers - Crawler - Query server - Indexer Crawlers and Spiders - Centralized control, Coordinated, Refresh, Filtering - Not the main problem Queries - Interface, processing, results Indexing - Data normalization, load balancing, data sharing
Harvesting Not just Web data - Caching, Duplication, Normalization Armies of crawlers Filtering collected data Gatherers - Collects and extracts on various schedules - Works with several brokers Brokers - Indexes and interfaces to queries - Works with other Brokers and Gatherers Topical Agents?
Web Crawling Issues Follow chains of URLs to gather more URLs Extract index (content) from each page Lather-Rinse-Repeat Update crawler to-do list Associate frequency of crawls Breadth or Depth first? Endless looping Duplicate pages/sites Changed page (or not really?) Dynamically generated pages Intranet pages Markup language getting in the way NOROBOTS What should a crawler get?
Indexing the Web Inverted File Index - Sorted words with pointers to location(s) & page(s) - Pointers are the focus (inversion) What about pages and sites? - Massive redundancy on well-organized sites Navigation Topics Content “State of the art indexing techniques” = 30% of text (not page) size. p 383 How can you tune an index for massively changing documents?
Ranking Boolean and Vector models mostly used - Why? - Works from the index, not the text Which ranking methods are best? - Datasets - Syntaxes - Users & Testing
Ranking Methods TF-IDF - Simple, smaller data sets Boolean Spread - Degrees of match - Within a document - Set of documents - Links between documents (meta docs?) Vector Spread - Standard cosine between query and index (to document) - Links with answer or pointing to answer Most Cited
Is Web ranking different? Links are the difference that makes the difference - Internal links on a page - Internal links on a site - Relationships between sites - Link freshness Kleinberg’s HITS method (1998) - Hypertext Induced Topic Search - Number of pages that point to (processed) query - Authorities (relevant content by links) - Hubs (links to varied authorities)
Problems with Hubs & Authorities Is more links always better? What about pages without many outgoing links? How do you count multiple links from within one page to another? Do automatically generated sites/pages have an advantage? - CMS systems may have linking “fingerprints” - Metadata How varied are the link weights? - Simple counts - Modified by other IR measures
Anatomy of a LS Web Search Engine Initial Google Design PageRank - PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) - “A model of user behavior” probability of a random surfer visiting a page is its PageRank + a damping factor (boredom) - Pages point to a page - Highly ranked pages point to a page - Anchor text is mined (the label for the link) - Proximity included
Anatomy 2 Repository of page content Document index - Forward (sorted) - Inverted (sorter) Lexicon of words & pointers Hit Lists of word occurrence(s) Crawlers Ranking Feedback of selection (~)
Popularity? Do you always want the most popular information source? - Talk Radio - New York Times Bestseller List - “Lincoln’s Doctors Dog” - “The C.S.I. Diet and Cookbook” Trend or Fad? Blogs, Editorials and Propaganda vs. “Facts”? Result Diversity Death of the Mid-List
Next Generation Web Search Search works well now (80%), but what’s next? We need to be user-focused, not data-focused How do we match search to the task? - Is it all about speed? - How could metadata support search tasks? Best search is browsing? - Faceted Search? Faceted Search - Suggesting = browsing for interfaces Cooking Related results Specialized interfaces Natural language queries (quesiton answering) “Real world” metadata Context, personalization, query specifics
Metasearch Issues One place for everything? First or Last place to look? Better or different interface? Combined, sorted results would be best - How to sort? - Sorting for different types of queries Syntax Errors State Information (monitoring) Copyright issues (robots) User, content and interface mismatches/challenges
Web Searching Metaphors How do people visualize the Web? Is Browsing better? Do we need new metaphors for using the Web? - Searching - Browsing - What else?
Assignments Read weekly Primary Readings & Participate in class discussions 10% - 1 page summaries Re-design Search Results interface 10% Web (log) analytics 20% Future of Search (“Google 2010”) (5 page paper) 10% Web Information Retrieval System Evaluation & Presentation 20% Main Project or Paper 30%
Re-design Search Results interface Choose a search engine (not Google) and re-design the query AND result page interfaces - Snap, Live, Ask, Technorati, Clusty, & many others… Discuss what search features are and their interfaces - Highlight the good & the bad (or hard to understand or use) - Use your own perspective as a novice user or habitual user of the search engine Sketch, Photoshop &/or re-build the HTML pages to show your improved interface designs - Explain why you made the interface (& feature) changes - Illustrate how people would use the new interface Compare to other search engines or search tools & interfaces to give context to your re-design
System Evaluation & Presentation - 5 page written evaluation of a Web IR System - technology overview (how it works) - a brief history of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use & its overall effectiveness
Future of Search paper How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications New interfaces for applications New datasets for applications