Sample Test Questions These are designed to help answer the questions on Exam 2 PLEASE do not ask me to answer any of the questions.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Evaluating the Performance of IR Sytems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine Optimization
CC Procesamiento Masivo de Datos Otoño Lecture 12: Conclusion
Sampath Jayarathna Cal Poly Pomona
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Clustering of Web pages
Text Based Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
The Anatomy of a Large-Scale Hypertextual Web Search Engine
אחזור מידע, מנועי חיפוש וספריות
Basic Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
Data Mining Chapter 6 Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
The Search Engine Architecture
Discussion Class 9 Google.
Introduction to Search Engines
Presentation transcript:

Sample Test Questions These are designed to help answer the questions on Exam 2 PLEASE do not ask me to answer any of the questions

CSCI 572 Review of Concepts/Algs/Etc. Presented This Semester CSCI 572 Review of Concepts/Algs/Etc. Presented This Semester * indicates presentation in second half of the semester Concepts (cont'd) *Centroids (classification) *Dendrogram (clustering) *Pay-Per-Click Auctions *Ontologies: WordNet, FreeBase, KnowledgeGraph Data Structures Suffix trees (autocomplete) Priority Queues (ranking) *B-trees (holding the dictionary) *Tries (spelling correction) Concepts Precision (ranking) Recall (ranking) tp, fp, fn, tn formulation Harmonic Mean (ranking) F Measure (ranking) Mean Average Precision Discounted Cumulative Gain *Ontologies *WordNet: hyponym, meronym, etc Algorithms Porters (stemming) Soundex (phonemes) Block Sort-Based Indexing Single-Pass In-Memory Indexing Logarithmic Merge Indexing Computing Cosine Scores *Map/Reduce *Google cloud *PageRank *HITS (ranking) *Levenshtein (spell correction) *K-Means Clustering *Rocchio (classification) *K-Nearest Neighbor (classification)

CSCI 572 Review of Concepts/Algs/Etc. Presented This Semester Techniques Crawling (distributed, techniques) Crawling: robots.txt, depth vs. breadth Crawling: URL representation, threading Stemming, Stop words, Lemmatization Inverted index, positional inverted index N-grams Cryptographic hashing (de-duplication) Shingling (de-duplication) Jaccard Similarity Document Boolean Model Document Vector Model Cosine similarity Euclidean distance *Champion lists, net-score *Edit distance Techniques (cont'd) TF-IDF (matching) *Purity Index (clustering) *Rand Index (clustering) *Microformats (snippets) Other Zipf’s Law Heaps Law *Spelling Correction program (Norvig) *Edit distance *Auctions

Questions on PageRank If PR(A) represents the PageRank of node A in directed graph G and C(A) is the number of outgoing links of A, define the PageRank formula Make sure you know how to compute the PageRank for simple graphs like the ones below What is a typical value for the damping factor in the PageRank calculation?

Questions on HITS What are the names given to the two types of nodes determined by HITS? Is the HITS algorithm carried out once the query is entered or after the query results have been determined? What sort of graph is produced by HITS? How does the HITS algorithm compute the Authority score? How does the HITS algorithm compute the Hub score?

Search Engine Advertising What is the difference between Google’s AdWords and AdSense programs? Approximately how much money did Google take in on advertising last year: 500 million, 1 billion, 10 billion, 50 billion, 100 billion? What does SEO stand for? Does Google accept banner ads? Describe the auction process Google uses to determine the placement of ads Describe the four possible forms of keyword matching that Google offers

Map/Reduce Questions What is Hadoop? Name two companies offering cloud services. For the word occurrences problem describe what the map phase does For the word occurrences problem describe what the reduce phase does Describe how Map/Reduce handles failure of a node

Knowledge Systems Questions Describe the difference between a taxonomy and an ontology Describe three types of knowledge typically held by KnowledgeBases Is Wikipedia an ontology? Is WordNet a KnowledgeBase? WordNet uses the terms synset, hyponym, meronym and holonym; define them The WikiMedia Foundation sponsors Wikipedia and WikiData; what is WikiData and how does it relate to Wikipedia?

Query Processing Questions Five strategies were mentioned for speeding up the answering of queries; describe each one in two or three sentences Does Google get more queries from desktop computers or from mobile devices? Does Google use human testers to evaluate search results?

Homework #4 Does Lucene come bundled with Solr? What sort of inverted index is produced by Lucene? What is the purpose of the Lucene tokenizers? Solr uses two types of scoring to produce search results, what are they? What port does Solr run on? Define a database shard What language is Solr written in?

Spell Checking and Correction Name three types of spelling errors There are two approaches for correcting a word that is not in the dictionary, what are they? How are n-grams used to solve the spelling correction problem? Define edit distance What is the name of the dictionary used in Norvig’s spelling corrector program? Does Norvig’s spelling corrector program only use corrections of edit distance 1 or does it use both edit distance 1 and 2? What are the computing time and space requirements for the Levenshtein algorithm?

Snippets Are snippets always taken from the body of a web page? Does the location of the sentence that contains the query keywords matter when it comes to choose which sentence to display as a snippet? Does Google use meta information in a web page to determine a snippet? Define ”rich snippets” Where can you find the specification for rich snippets? Is the schema.org formalism an object hierarchy?

Clustering Name a search engine other than Google/Bing/Yahoo that emphasizes clustering of results Describe the difference between hard clustering and soft clustering What definition of similarity is used by the k-means clustering algorithm? Describe the optimal k-means optimization problem; is it NP-Hard? How are the k means typically chosen in the k-means algorithm? What is the computing time for the k-means algorithm? Describe the agglomerative clustering algorithm What is a dendrogram?

Question Answering Name the four “W” questions used by question/answering systems Do question answering systems use search engine results to help answer questions? What are the TREC conferences? There are five general approaches to question answering; describe each one Define the Mean Reciprocal Rank

Classification How does classification differ from clustering? Define a centroid of a set of n points, each point a vector of length m The Rocchio algorithm has two phases, what are they? In the Rocchio algorithm how are the centroids determined? Describe the k-nearest neighbor algorithm What is the contiguity hypothesis? How is k chosen in the k-nearest neighbor algorithm? What is a Voroni diagram?