Download presentation
Presentation is loading. Please wait.
Published byAnna Doyle Modified over 9 years ago
1
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789
2
©2012 Paula Matuszek Knowledge l Knowledge is captured in large quantities and many forms l Much of the knowledge is in unstructured text –books, journals, papers –letters –web pages, blogs, tweets l A very old process –accelerated greatly with the invention of the printing press –and again with the invention of computers –and again with the advent of the web l Thus the increasing importance of text mining!
3
©2012 Paula Matuszek A question! l People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want. They have an information need. l Hopefully there is some information somewhere that will satisfy that need l At its most general, information retrieval is the process of finding the information that meets that need.
4
©2012 Paula Matuszek Basic Information Retrieval l Simplest approach: –Knowledge is organized into chunks (pages) –Goal is to return appropriate chunks l Not a new problem l But some new solutions! –Web Search engines –Text mining includes this process also –still dealing with lots of unstructured text –finding the appropriate “chunk” can be viewed as a classification problem
5
©2012 Paula Matuszek Search Engines l Goal of search engine is to return appropriate chunks l Steps involve include –asking a question –finding answers –evaluating answers –presenting answers l Value of a search engine depends on how well it does on all of these.
6
©2012 Paula Matuszek Asking a question l Reflect some information need l Query Syntax needs to allow information need to be expressed –Keywords –Combining terms –Simple: “required”, NOT (+ and -) –Boolean expressions with and/or/not and nested parentheses –Variations: strings, NEAR, capitalization. –Simplest syntax that works –Typically more acceptable if predictable l Another set of problems when information isn’t text: graphics, music
7
©2012 Paula Matuszek Finding the Information Goal is to retrieve all relevant chunks. Too time- consuming to do in real-time, so search engines index pages. l Two basic approaches –Index and classify by hand –Automate l For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue. –stemming? –stopwords? –capitalization?
8
©2012 Paula Matuszek Indexing by Hand l Indexing by hand involves having a person look at information items and assign them to categories. –Assumes taxonomy of categories exists –Each document can go into multiple categories –Creates high quality indices –Expensive to create –Supports hierarchical browsing for retrieval as well as search l Inter-rater reliability is an issue; requires training and checking to get consistent category assignment
9
©2012 Paula Matuszek Indexing By Hand l For focused collections, even very large ones, feasible –Medline –ACM papers –NY Times hand-indexes all abstracts l For the web as a whole, not yet feasible l Evolving solutions –social bookmarking: delicious, reddit, digg –Hash tags: twitter, Google+ –Sometimes pre-structured. More often completely freeform –In some domains become a folksonomy.
10
©2012 Paula Matuszek Automated Indexing l Automated indexing involves parsing documents to pull out key words and creating a table which links keywords to documents –Doesn’t have any predefined categories or keywords –Can cover a much higher proportion of the information available –Can update more quickly –Much lower quality, therefore important to have some kind of relevance ranking
11
©2012 Paula Matuszek Or IE-Based l Good information extraction tools can be used to extract the important terms –Using gazetteers and ontologies to identify terms –Using named entity and other rules to assign categories l I2EOnDemand is a good example
12
©2012 Paula Matuszek Automating Search l Always involves balancing factors: –Recall, Precision –Which is more important varies with query and with coverage –Speed, storage, completeness, timeliness –Query response needs to be fast –Documents searched need to be current –Ease of use vs power of queries –Full Boolean queries very rich, very confusing. –Simplest is “and”ing together keywords; fast, straightforward.
13
©2012 Paula Matuszek Search Engine Basics l A spider or crawler starts at a web page, identifies all links on it, and follows them to new web pages. l A parser processes each web page and extracts individual words. l An indexer creates/updates a hash table which connects words with documents l A searcher uses the hash table to retrieve documents based on words A ranking system decides the order in which to present the documents: their relevance
14
©2012 Paula Matuszek Selecting Relevant Documents l Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents have been separated into individual files l Remaining components must parse, index, find, and rank documents. l Traditional approach is based on the words in the documents (predates the web)
15
©2012 Paula Matuszek Extracting Lexical Features l Process a string of characters –assemble characters into tokens (tokenizer) –choose tokens to index In place (problem for www) l Standard lexical analysis problem l Lexical Analyser Generator, such as lex l Tokenizers such as the NLTK and GATE tokenizers
16
©2012 Paula Matuszek Lexical Analyser l Basic idea is a finite state machine l Triples of input state, transition token, output state l Must be very efficient; gets used a LOT 0 1 2 blank A-Z blank, EOF
17
©2012 Paula Matuszek Design Issues for Lexical Analyser l Punctuation –treat as whitespace? –treat as characters? –treat specially? l Case –fold? l Digits –assemble into numbers? –treat as characters? –treat as punctuation?
18
©2012 Paula Matuszek Lexical Analyser l Output of lexical analyser is a string of tokens l Remaining operations are all on these tokens l We have already thrown away some information; makes more efficient, but limits the power of our search –can’t distinguish “VITA” from “Vita” –Can be somewhat remedied at “relevance” step
19
©2012 Paula Matuszek Stemming l Additional processing at the token level l Turn words into a canonical form: –“cars” into “car” –“children” into “child” –“walked” into “walk” l Decreases the total number of different tokens to be processed l Decreases the precision of a search, but increases its recall l NLTK, GATE, other stemmers
20
©2012 Paula Matuszek Noise Words (Stop Words) l Function words that contribute little or nothing to meaning l Very frequent words –If a word occurs in every document, it is not useful in choosing among documents –However, need to be careful, because this is corpus-dependent l Often implemented as a discrete list
21
©2012 Paula Matuszek Example Corpora l We are assuming a fixed corpus. Some sample corpora: –Medline Abstracts –Email. Anyone’s email. –Reuters corpus –Brown corpus l Textual fields, structured attributes –Textual: free, unformatted, no meta-information –Structured: additional information beyond the content
22
©2012 Paula Matuszek Structured Attributes for Medline l Pubmed ID l Author l Year l Keywords l Journal
23
©2012 Paula Matuszek Textual Fields for Medline l Abstract –Reasonably complete standard academic English –Capturing the basic meaning of document l Title –Short, formalized –Captures most critical part of meaning –Proxy for abstract
24
©2012 Paula Matuszek Structured Fields for Email l To, From, Cc, Bcc l Dates l Content type l Status l Content length l Subject (partially)
25
©2012 Paula Matuszek Text fields for Email l Subject –Format is structured, content is arbitrary. –Captures most critical part of content. –Proxy for content -- but may be inaccurate. l Body of email –Highly irregular, informal English. –Entire document, not summary. –Spelling and grammar irregularities. –Structure and length vary.
26
©2012 Paula Matuszek Indexing l We have a tokenized, stemmed sequence of words l Next step is to parse document, extracting index terms –Assume that each token is a word and we don’t want to recognize any more complex structures than single words. l When all documents are processed, create index
27
©2012 Paula Matuszek Basic Indexing Algorithm l For each document in the corpus –get the next token –save the posting in a list –doc ID, frequency l For each token found in the corpus –calculate #doc, total frequency –sort by frequency –This is the inverse index
28
©2012 Paula Matuszek Fine Points l Dynamic Corpora: requires incremental algorithms l Higher-resolution data (e.g, char position) l Giving extra weight to proxy text (typically by doubling or tripling frequency count) l Document-type-specific processing –In HTML, want to ignore tags –In email, maybe want to ignore quoted material
29
©2012 Paula Matuszek Choosing Keywords l Don’t necessarily want to index on every word –Takes more space for index –Takes more processing time –May not improve our resolving power l How do we choose keywords? –Manually –Statistically l Exhaustivity vs specificity
30
©2012 Paula Matuszek Manually Choosing Keywords l Unconstrained vocabulary: allow creator of document to choose whatever he/she wants –“best” match –captures new terms easily –easiest for person choosing keywords l Constrained vocabulary: hand-crafted ontologies –can include hierarchical and other relations –more consistent –easier for searching; possible “magic bullet” search
31
©2012 Paula Matuszek Examples of Constrained Vocabularies l ACM Computing Classification System (www.acm.org/class/1998) –H: Information Retrieval l H3: Information Storage and Retrieval l H3.3: Information Search and Retrieval »Clustering »Information Filtering »Query formulation »Relevance feedback » etc. l Medline Headings (www.nlm.nih.gov/mesh/MBrowser.html) –L: Information Science l L01: Information Science l L01.700: Medical Informatics l L01.700.508: Medical Informatics Applications l L01.700.508.280: Information Storage and Retrieval »MedlinePlus [L01.700.508.280.730]
32
©2012 Paula Matuszek Automated Vocabulary Selection l Frequency: Zipf’s Law. –In a natural language corpus, frequency of a word is inversely proportional to its position in a frequency table. Within one corpus, words with middle frequencies are typically “best” –We have used this in NLTK classification, ignoring the most frequent terms in creating the BOW. l Document-oriented representation bias: lots of keywords/document l Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.
33
©2012 Paula Matuszek Choosing Keywords l “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall. l Words which have no resolving power within a corpus may be best choices across corpora
34
©2012 Paula Matuszek Keyword Choice for WWW We don’t have a fixed corpus of documents l New terms appear fairly regularly, and are likely to be common search terms l Queries that people want to make are wide-ranging and unpredictable l Therefore: can’t limit keywords, except possibly to eliminate stop words. l Even stop words are language-dependent. So determine language first.
35
©2012 Paula Matuszek Comparing and Ranking Documents l Once our search engine has retrieved a set of documents, we may want to l Rank them by relevance –Which are the best fit to my query? –This involves determining what the query is about and how well the document answers it l Compare them –Show me more like this. –This involves determining what the document is about.
36
©2012 Paula Matuszek Determining Relevance by Keyword l The typical web query consists entirely of keywords. l Retrieval can be binary: present or absent More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? l Simple strategies: –How many times does word occur in document? –How close to head of document? –If multiple keywords, how close together?
37
©2012 Paula Matuszek Keywords for Relevance Ranking l Count: repetition is an indication of emphasis –Very fast (usually in the index) –Reasonable heuristic –Unduly influenced by document length –Can be "stuffed" by web designers l Position: Lead paragraphs summarize content –Requires more computation –Also reasonably heuristic –Less influenced by document length –Harder to "stuff"; can only have a few keywords near beginning
38
©2012 Paula Matuszek Keywords for Relevance Ranking l Proximity for multiple keywords –Requires even more computation –Obviously relevant only if have multiple keywords –Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all –Very hard to "stuff" l All keyword methods –Are computationally simple and adequately fast –Are effective heuristics –typically perform as well as in-depth natural language methods for standard search
39
©2012 Paula Matuszek Comparing Documents "Find me more like this one" really means that we are using the document as a query. l This requires that we have some conception of what a document is about overall. l Depends on context of query. We need to –Characterize the entire content of this document –Discriminate between this document and others in the corpus l This is basically a document classification problem.
40
©2012 Paula Matuszek Describing an Entire Document So what is a document about? l TF*IDF: can simply list keywords in order of their TF*IDF values Document is about all of them to some degree: it is at some point in some vector space of meaning
41
©2012 Paula Matuszek Vector Space l Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge space -- it is or is not about each of those terms. l Consider each term as a vector. Then –We have an n-dimensional vector space –Where n is the number of terms (very large!) –Each document is a point in that vector space The document position in this vector space can be treated as what the document is about.
42
©2012 Paula Matuszek Similarity Between Documents l How similar are two documents? –Measures of association –How much do the feature sets overlap? –Simple Matching coefficient: take into account exclusions –Cosine similarity –similarity of angle of the two document vectors –not sensitive to vector length l Same basic similarity ideas as classification and clustering
43
©2012 Paula Matuszek Additional Search Engine Issues l Freshness: how often to revisit documents l Eliminate duplicate documents l Eliminate multiple documents from one site l Provide good context l Non-content based features: citation graphs (basis of Page rank) l Search Engine Optimization
44
©2012 Paula Matuszek Beyond Simple Search l Information Extraction on queries to recognize some common patterns –Airline flights –tracking #w l Rich IE systems like I2E l Taxonomy browsing
45
©2012 Paula Matuszek Beyond Unstructured Text l “Improved” search l Specific types of text l Non-text search constraints l Faceted Search l Searching non-text information assets l Personalizing search
46
©2012 Paula Matuszek Improved Search l Modern search engines tweak your query a lot. Google, for instance, says it will normally –suggest spelling corrections and alternative spellings –personalize your search by using information such as sites you’ve visited before –include synonyms of your search terms –find results that match similar terms to those in your query –stem your search terms ( http://support.google.com/websearch/bin/answer.py?hl=en&p=g_verb&answer=1734130 ) http://support.google.com/websearch/bin/answer.py?hl=en&p=g_verb&answer=1734130
47
©2012 Paula Matuszek Domain of Documents l May be desirable to limit search to specific types of document l Google gives you, among other things –news –books –blogs –recipes –patents –discussions l May be based on the source l Or may be text mining (classification) at work :-)
48
©2012 Paula Matuszek Non-Text Constraints l We may know some things about documents that are not captured in the tokens: l Meta-information included in the document –date created and modified –author, department –keywords or tags l Information that can be determined by examining the document –language –images included? –reading level
49
©2012 Paula Matuszek Faceted Search l Faceted Search: constrain search along several attributes l Research topic in information retrieval especially for the last ten years –Flamenco project at Berkeley (flamenco.berkeley.edu) –CiteSeer project at Penn State (citeseer.ist.psu.edu) l Labeling with facets has same issues as hand- indexing –except when you already have the information in a database somewhere l Has become popular primarily for online commerce.
50
©2012 Paula Matuszek Faceted Search Examples l Search Amazon for “rug” –Generally applicable facets: Department, Amazon Prime Eligible, Average Customer Review, Price, Discount, Availability –Object-specific facets: size, material, pattern, style, theme, color l Flamenco Fine Arts demo –http://orange.sims.berkeley.edu/cgi- bin/flamenco.cgi/famuseum/Flamencohttp://orange.sims.berkeley.edu/cgi- bin/flamenco.cgi/famuseum/Flamenco
51
©2012 Paula Matuszek Searching non-Text Assets l Our information chunks might not be text –images, sounds, videos, maps l Images, sounds, videos often based on proxy information: captions, tags l Information Extraction useful for maps l Still an active research area, typically at the intersection of information retrieval and machine learning. http://www.gwap.com/gwap/ http://www.gwap.com/gwap/
52
©2012 Paula Matuszek Personalized Search l What searches you have done in the past tells a lot about your information needs l If a search engine has and makes use of that information it can improve your searches –web page history –cookies –explicit sign-in l Relevance can be influenced by pages you’ve visited, click-throughs l May go beyond to use information such as your blog posts, friends links, etc. l Can be a privacy issue
53
©2012 Paula Matuszek Summary l Information Retrieval is the process of finding and providing to the user chunks of information which are relevant to some information need l Where the chunks are text, free text search tools are the common approach l Text mining tools such as document classification and information extraction can improve the relevance of search results l This is not new with the web, but the web has had a massive impact on the area l It continues to evolve, rapidly
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.