How Do We Find Information?
Key Questions What are we looking for? How do we find it? Why is it difficult? “A prudent question is one-half of wisdom” Francis Bacon Search Engines 2
What are we looking for? We are Looking for X. Q&A: population of China Known-item Search: “Cather in the Rye” Looking for something like/about X. General/background info: Taliban Collection Development: IR Literature Similar to (known) X: like “Cather in the Rye” WhatyoumacallX: “the rye-boy story” Looking for something Problem Resoultion: how can we fight terrorism? Knowledge Development: what is IR? Looking Need something, but don’t know what – what’s it all about? Serendipity: Web surfing Search Engines 3
How do we find it? Brute force search Easy to build, maintain, and use Searcher does all the work; Hard to get satisfaction Organize/structure the data (Information Organization) Intuitive to use Hard to build and maintain Knowledge of builder’s language & organization structure is crucial Use a search tool (Information Retrieval) Easier to build and maintain: Less manipulation of data Sometimes works, sometimes not (Helps to know the language of the data) Ask the experts (Expert System) Easy and satisfying to use (by definition) “Expert” knowledge is transitory, hard to encapsulate Go with the crowd (User Ratings > Recommender System > PageRank) Relatively easy to build and maintain Limited utility: doesn’t work with “unpopular” X Zen-Fusion search. Search Engines 4
Information Seeking Process: Dynamic, Interactive, Iterative UserIntermediaryInformation What am I looking for? - Identification of info. need How do I find it? - Query formulation What are we looking for? - Discovery of user’s information need - Query representation Where is it? - Query-document matching What is it? - Collection - Classification How is it found? - Data structure - Representation 5 Search Engines
IR vs. IO Information Organization: - Add structure & annotation Information Retrieval - Create a searchable index Information Access - Retrieve information Data Mining - Discover Knowledge 6 Search Engines
Information Retrieval Representation - indexing, term weighting Searchable IndexRaw Data Query Formulation - “What is IR?” Search Results - (ranked) document list D1wd1 wd2 wd3 D2wd2 wd4 wd2 wd3 D3wd1 wd4 D1D2D3 wd1101 wd2120 wd3110 wd4011 1D2 2D1 3D3 7 Search Engines
Information Organization Representation - NLP & Machine Learning Organized DataRaw Data Query Formulation - “What is IR?” Search Results - document groups 8 Search Engines
Natural Language Processing (NLP) Research Area, technique, tool for Knowledge Discovery, Data Mining Lexical Analysis using Part-of-Speech (POS) tagging Sentence Parsing 9 Search Engines
Machine Learning Research Area, technique, tool for Information Organization, Knowledge Discovery, Data Mining Information Organization via Supervised Learning (Automatic Classification) Unsupervised Learning (Clustering) Class 1 Class 2 Class 1 Class 2 Classification Clustering 10 Search Engines
Clustering Document Clustering Cluster Hypothesis – Documents having similar contents tend to be relevant to the same query Rank clusters by Query-Cluster Similarity – Cluster documents based on vector similarity Post-retrieval clustering – Scatter-Gather Scatter-Gather Keyword Clustering Automatic Thesaurus Construction – Query Expansion IO for IR 11 Search Engine
Classification Document Categorization classify documents into manually defined categories – supports hierarchical browsing, query expansion via relevance feedback Document Indexing assign keywords to documents – automatic indexing with controlled vocabulary, metadata generation Document Filtering e.g. news delivery, spam filtering Query Classification collection selection algorithm selection IO for IR 12 Search Engine
Search Engines 13