©2002 Paula Matuszek iMiner Introduction. ©2002 Paula Matuszek iMiner from IBM l Text Mining tool with multiple components l Text Analysis tools includ.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Improved TF-IDF Ranker
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
WMES3103 : INFORMATION RETRIEVAL
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Ch 4: Information Retrieval and Text Mining
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Recommender systems Ram Akella November 26 th 2008.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Chapter 6: Information Retrieval and Web Search
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Vector Space Models.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Plan for Today’s Lecture(s)
Text Based Information Retrieval
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
CS 430: Information Discovery
Multimedia Information Retrieval
Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Information Retrieval
Presentation transcript:

©2002 Paula Matuszek iMiner Introduction

©2002 Paula Matuszek iMiner from IBM l Text Mining tool with multiple components l Text Analysis tools includ –Language Identification Tool –Feature Extraction Tool –Summarizer Tool –Topic Categorization Tool –Clustering Tools – –

©2002 Paula Matuszek iMiner for Text 2 l Basic technology includes: –authority file with terms –heuristics for extracting additional terms –heuristics for extracting other features –Dictionaries with parts of speech –Partial parsing for part-of-speech tagging –Significance measure for terms: Information Quotient (IQ). l Knowledge base cannot be directly expanded by end user l Strong machine-learning component

©2002 Paula Matuszek Language Identification l Can analyze –an entire document –a text string input from the command line l Currently handles about a dozen language l Can be trained; ML tool with input in language to be learned l Determines approximate proportion in bilingual documents

©2002 Paula Matuszek Language Identification l Basically treated as a categorization problem, where each language is a category l Training documents are processed to extract terms. l Importance of terms for categorization is determined statistically l Dictionaries of weighted terms are used to determine language of new documents

©2002 Paula Matuszek Feature Extraction l Locate and categorize relevant features in text l Some features are themselves of interest l Also starting point for other tools like classifiers, categorizers. l Features may or may not be “meaningful” to a person l Goal is to find aspects of a document which somehow characterize it

©2002 Paula Matuszek Name Extraction l Extracting Proper Names –People, places, organizations –Valuable clues to subject of text l Dictionaries of canonical forms l Additional names extracted from documents –Parsing finds tokens –Additional parsing groups tokens into noun phrases –Rules identify tokens which are names –Variant groups are assigned a canonical name which is the most explicit variant found in document

©2002 Paula Matuszek Examples for Name Extraction l “This subject is taught by Paula Matuszek.” –Recognize Paula as a first name of a person –Recognize Matuszek as a capitalized word following a first name. –Therefore “Paula Matuszek” is probably the name of a person. l “This subject is taught by Villanova University.” –Recognize Villanova as a probable name based on capitalization. –Reognize University as a term which normally names an institution.. –Therefore “Villanova University” is probably the name of an institution. l “This subject is taught by Howard University” –BOTH of these sets of rules could apply. So rules need to be prioritized to determine more likely parse.

©2002 Paula Matuszek Other Rule Examples l Dr., Mr,. Ms. are titles, and titles followed by capitalized words frequently indicate names. If followed by only one word, it’s the last name l Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN. l Nouns can be names. Verbs can’t.

©2002 Paula Matuszek Abbreviation/Acronym Extraction l Fruitful source of variants for names and terms l Existing dictionary of common terms l Name followed by “(“ [A-Z]+ “)” probably gives an abbreviation. l Conventions regarding word internal case and prefixes. “MSDOS” matches “MicroSoft DOS”, “GB” matches gigabyte.

©2002 Paula Matuszek Number Extraction l Useful primarily to improve performance of other extractors. l Variant expressions of numbers –One thousand three hundred and twenty seven –thirteen twenty seven –1327 l Other numeric expressions –twenty-seven percent –27% l Base forms are easy; most of effort is variants and determining canonical form based on rules

©2002 Paula Matuszek Date Extraction l Absolute and relative dates l Produces canonical form. –March 27, /03/27 –tomorrowref+0000/00/01 –a year agoref-0001/00/00 l Similar techniques and issues as for numbers

©2002 Paula Matuszek Money Extraction l Recognizes currencies and produces canonical representation l Uses number extractor l Examples –“twenty-seven dollars” “ dollars USA” –“DM 27” “ marks Germany”

©2002 Paula Matuszek Term Extraction l Identify other important terms found in text l Other major lexical clue for subject, especially if repeated. l May use output from other extractors in rules l Recognizes common lexical variants and reduces to canonical form -- stemming l Machine learning is much more important here

©2002 Paula Matuszek Term Extraction l Dictionary with parts of speech info for English l Pattern matching to find noun phrase structure typical of technical terms. l Feature repositories: –Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics –Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics l Authority and residue dictionaries trained

©2002 Paula Matuszek Information Quotient l Each feature (word, phrase, name) extracted is assigned an information quotient l Represents the significance of the feature in the document l TF-IDF: Term frequency-Inverse Document Frequency l Position information l Stop words

©2002 Paula Matuszek Feature Extraction Demo l Tool may be used for highlighting, etc, on documents to be displayed l Features extracted also form basis for other tools l Note that this is not full information extraction, although it is a starting point l

©2002 Paula Matuszek Other Features l Feature Extractor also identifies other features used by other text analysis tools: –sentence boundaries –paragraph boundaries –document tags –document structure –collection statistics

©2002 Paula Matuszek Summarizer Tools l Collection of sentences extracted from document l Characteristic of document content l Works best for well-structured documents l Can specify length l Must apply feature extraction first

©2002 Paula Matuszek Summarizer l Feature extractor run first l Words are ranked l Sentences are ranked l Highest ranked sentences are chosen l Configurable: for length of sentence, for word salience l Works best when document is part of a collection

©2002 Paula Matuszek Word Ranking l Words scored IF –Appears in structures such as titles and captions –Occurs more often in document than in collection (word salience) –Occurs more than once in a document l Score is –salience if > threshold: tf*idf (by default) –weighting factor if occurs in title, heading caption

©2002 Paula Matuszek Sentence Ranking l Scored according to relevance in document and position in document. l Sum of –Scores of individual words –Proximity of sentence to beginning of its paragraph –“Bonus” for final sentence in long paragraph and final paragraph in long documents –Proximity of paragraph to beginning of document l All configurable

©2002 Paula Matuszek Summarization Examples l Examples from IBM documentation l

©2002 Paula Matuszek Some Common Statistical Measures (a brief digression) l TF x IDF l Pairwise and multiple-word phrase counts l Some other common statistical measures: –information gain: how many bits of information do we gain by knowing that a term is present in a document –mutual information: how likely a term is to occur in a document –term strength: likelihood that a document will occur in both of two closely-related documents

©2002 Paula Matuszek Topic Categorization Tool l Assign documents to predetermined categories l Must first be trained –Training tool creates category scheme –Dictionary that stores significant vocabulary statistics l Output is list of possible categories and probabilities for each document l Can filter initial schema for faster processing

©2002 Paula Matuszek Features Used for Categorizing l Linguistic Features –Uses the features extracted by Feature Extraction tool l N-Grams –letter groupings and short words. –Can be used for non-English, because it doesn’t depend on heuristics –Used by Language categorizer

©2002 Paula Matuszek Document Categorizing l Individual document is analyzed for features l Features are compared to those determined for categories: –terms present/absent –IQ of terms –frequencies –document structure

©2002 Paula Matuszek Document Categorization l Important issue is determining which features! High dimensionality is expensive. l Ideally you want a small set of features which is –present in all documents of one category –absent in all other documents l In actuality, not that clean. So: –use features with relatively high separation –eliminate feature which correlates very highly with another feature (to reduce dimension space)

©2002 Paula Matuszek Categorization Demo l Typically categorization is a component in a system which then “does something” with the categorized documents l Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving. l

©2002 Paula Matuszek Clustering Tools l Organize documents without pre-existing categories l Hierarchical clustering –creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up l Binary Relational clustering –Creates a flat set of clusters with each document assigned to its best fit and relations between clusters captured

©2002 Paula Matuszek Hierarchical Clustering l Input is a set of documents l Output is a dendogram –Root –Intermediate level –leaves –link to actual documents l Slicing is used to create manageable HTML tree

©2002 Paula Matuszek Steps in Hierarchical Clustering l Select Linguistic Preprocessing technique: determines “similarity” l Cluster documents: create dendogram based on similairy l Define shape of tree with slicing technique and produce HTML output

©2002 Paula Matuszek Linguistic Preprocessing l Determining similarity between documents and clusters: how do we define “similar”? –Lexical affinity. Does not require any preprocessing –Linguistic Features. Requires that feature extractor be run first. l iMiner is either/or; you cannot combine the two methods of determining similiarity

©2002 Paula Matuszek Clustering: Lexical Affinities l Lexical affinities: groups of words which appear frequently close together –created “on the fly” during a clustering task –word pairs –stemming and other morphological analysis –stop words l Results in documents with textual similiarity being clustered together

©2002 Paula Matuszek Clustering: Linguistic Features l Linguistic features: Use features extracted by the feature extraction tool –Names of organizations –Domain Technical Terms –Names of Individuals l Can allow focusing on specific areas of interest l Best if you have some idea what you are interested in

©2002 Paula Matuszek Hierarchical Clustering Steps l Put each document in a cluster, characterized by its lexical or linguistic features l Merge the two most similar clusters l Continue till all clusters are merged

©2002 Paula Matuszek Hierarchical Clustering: Slicing l The Dendogram is too big to be useful l Slicing reduces the size of the tree by merging clusters if they are “similar enough”. –top threshold: collapse any tree which exceeds it –bottom threshold: group under root any cluster which is lower –Remaining clusters make a new tree –# of steps sets depth of tree

©2002 Paula Matuszek Typical Slicing Parameters l Bottom –start around 5% or 10% similar –90% would mean only virtually identical documents get grouped l Top –good default is 90% –if want really identical, set to 100% l Depth: –Typically 2 to 10 –Two would give you duplicates and rest

©2002 Paula Matuszek Binary Relational Clustering l Binary Relational clustering –Creates a flat set of clusters –Each document assigned to its best fit –Relations between clusters captured l Similarity based on features extracted by Feature Extraction tool

©2002 Paula Matuszek Relational Clustering: Document Similarity l Based on comparison of descriptors –Frequent descriptors across collection given more weight: priority to wide topics –Rare descriptors given more weight: large number of very focused clusters –Both, with rare descriptors given slightly higher weight: relatively focused topics but fewer clusters l Descriptors are binary: present or absent

©2002 Paula Matuszek Relational Clustering l Descriptors are features extracted by feature extraction tool. l Similarity threshold: at 100% only identical documents are clustered l Max # of clusters: overrides similiarity threshold to get number of clusters specified

©2002 Paula Matuszek Binary Relational Clustering Outputs l Outputs are –clusters: topics found, importance of topics, degree of similiarity in cluster –links: sets of common descriptors between clusters

©2002 Paula Matuszek Clustering Demo l Patents from “class 395”: information processing system organization l 10% for top, 1% for bottom, total of 5 slices l lexical affinity l

©2002 Paula Matuszek Summary l iMiner has a rich set of text mining tools l Product is well-developed, stable l No explicit user-modifiable knowledge base -- uses automated techniques and built-in KB to extract relevant information l Can be deployed to new domains without a lot of additional work l BUT not as effective in many domains as a tool with a good KB l No real information extraction capability

©2002 Paula Matuszek Information Extraction Overview l Given a body of text: extract from it some well-defined set of information l MUC conferences l Typically draws heavily on NLP l Three main components: –Domain knowledge base –Extraction Engine –Knowledge model

©2002 Paula Matuszek Information Extraction Domain Knowledge Base l Terms: enumerated list of strings which are all members of some class. –“January”, “February” –“Smith”, “Wong”, “Martinez”, “Matuszek” –“”lysine”, “alanine”, “cysteine” l Classes: general categories of terms –Monthnames, Last Names, Amino acids –Capitalized nouns, –Verb Phrases

©2002 Paula Matuszek Domain Knowledge Base l Rules: LHS, RHS, salience l Left Hand Side (LHS): a pattern to be matched, written as relationships among terms and classes l Right Hand Side (RHS): an action to be taken when the pattern is found l Salience: priority of this rule (weight, strength, confidence)

©2002 Paula Matuszek Some Rule Examples: l => l => print “Birthdate”,, l => create address database record l “/” “/” => create date database record (50) l “/” “/” => create date database record (60) l “.” => l => create “relationship” database record

©2002 Paula Matuszek Generic KB l Generic KB: KB likely to be useful in many domains –names –dates –places –organizations l Almost all systems have one l Limited by cost of development: it takes about 200 rules to define dates reasonably well, for instance.

©2002 Paula Matuszek Domain-specific KB l We mostly can’t afford to build a KB for the entire world. l However, most applications are fairly domain-specific. l Therefore we build domain-specific KBs which identify the kind of information we are interested in. –Protein-protein interactions –airline flights –terrorist activities

©2002 Paula Matuszek Domain-specific KBs l Typically start with the generic KBs l Add terminology l Figure out what kinds of information you want to extract l Add rules to identify it l Test against documents which have been human-scored to determine precision and recall for individual items.

©2002 Paula Matuszek Knowledge Model l We aren’t looking for documents, we are looking for information. What information? l Typically we have a knowledge model or schema which identifies the information components we want and their relationship l Typically looks very much like a DB schema or object definition

©2002 Paula Matuszek Knowledge Model Examples l Personal records –Name –First name –Middle Initial –Last Name –Birthdate –Month –Day –Year –Address

©2002 Paula Matuszek Knowledge Model Examples l Protein Inhibitors –Protein name (class?) –Compound name (class?) –Pointer to source –Cache of text –Offset into text

©2002 Paula Matuszek Knowledge Model Examples l Airline Flight Record –Airline –Flight l Number l Origin l Destination l Date »Status »departure time »arrival time

©2002 Paula Matuszek Summary l Text mining below the document level l NOT typically interactive, because it’s slow (1 to 100 meg of text/hr) l Typically builds up a DB of information which can then be queries l Uses a combination of term- and rule- driven analysis and NLP parsing. l AeroText: very good system developed by LMCO; we will get a complete demo on March 26.