Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
History InQuery was originally a research product from Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst A commercial-strength InQuery API from Sovereign Hill Software InQuery 5.0 with LCA and Graphical User Interface
Outline Text mining Text mining using “local context analysis” (LCA). Text mining using “top concepts” Concept recognizers Demonstration of LCA and “top concepts” Q & A
Text Mining Helps find needle in the hay stack Query expansion Discovers interesting relationships between concepts Discovers characteristics about the database
Concepts Words Noun phrases People names Company names User-defined concepts
Local Context Analysis (LCA) Associates a query to a ranked list of concepts for several concept types (noun phrases, people names..) Concept association is done on the fly –no complex databases to be created –changes to the database are immediately taken into account.
Background Unit of retrieval is a passage (local context), in contrast to a document in regular search. A passage is a window of words of length n Overlapping passages are used.
LCA Process Generate candidate passages (sub- documents) Extract concepts and their statistics Apply LCA algorithm to rank the concepts for each concept type
Step 1: Generate Candidate Passages The documents are split into passages (virtual sub-documents) Evaluate the query on these passages to generate a weight for each passage Rank the passages Select the top m best passages
Step 2: Extract Concepts Extract the passages from their respective documents for all the passages in the candidate passage list. Each passage in the candidate list is passed through a set of “concept recognizers” to extract respective concept lists. Generate passage level statistics for all concepts and query terms
Step 3: Apply LCA Algorithm Generate local context statistics for concepts and query terms (specific to the set of candidate passages) Use LCA algorithm to generate weights for concepts. The passage level and local context level statistics are used.. Rank the concepts and select top n The above steps are repeated for all concept types.
Text Mining Using Top Concepts Retrieve documents Extract concepts from each document using “concept recognizers” Generate most frequently occurring concepts for all concept types. Persist the most frequently occurring concepts.
Noun Phrase Recognizer Tokenization to generate words Parts-of-speech tagging (noun, verb, etc.) Select noun phrases
Other Recognizers Company and people name recognizers –based on pattern matching rules –uses external lists of names for normalization and additional evidence. User-defined recognizer –uses a user provided list of concepts (single/multiword) –generates a state machine
Demonstration of LCA and Top Concepts in InQuery 5.1