Download presentation
Presentation is loading. Please wait.
Published byOwen Sanders Modified over 9 years ago
1
©2002 Paula Matuszek iMiner Introduction
2
©2002 Paula Matuszek iMiner from IBM l Text Mining tool with multiple components l Text Analysis tools includ –Language Identification Tool –Feature Extraction Tool –Summarizer Tool –Topic Categorization Tool –Clustering Tools –http://www-4.ibm.com/software/data/iminer/fortext/index.htmlhttp://www-4.ibm.com/software/data/iminer/fortext/index.html –http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htmhttp://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htm
3
©2002 Paula Matuszek iMiner for Text 2 l Basic technology includes: –authority file with terms –heuristics for extracting additional terms –heuristics for extracting other features –Dictionaries with parts of speech –Partial parsing for part-of-speech tagging –Significance measure for terms: Information Quotient (IQ). l Knowledge base cannot be directly expanded by end user l Strong machine-learning component
4
©2002 Paula Matuszek Language Identification l Can analyze –an entire document –a text string input from the command line l Currently handles about a dozen language l Can be trained; ML tool with input in language to be learned l Determines approximate proportion in bilingual documents
5
©2002 Paula Matuszek Language Identification l Basically treated as a categorization problem, where each language is a category l Training documents are processed to extract terms. l Importance of terms for categorization is determined statistically l Dictionaries of weighted terms are used to determine language of new documents
6
©2002 Paula Matuszek Feature Extraction l Locate and categorize relevant features in text l Some features are themselves of interest l Also starting point for other tools like classifiers, categorizers. l Features may or may not be “meaningful” to a person l Goal is to find aspects of a document which somehow characterize it
7
©2002 Paula Matuszek Name Extraction l Extracting Proper Names –People, places, organizations –Valuable clues to subject of text l Dictionaries of canonical forms l Additional names extracted from documents –Parsing finds tokens –Additional parsing groups tokens into noun phrases –Rules identify tokens which are names –Variant groups are assigned a canonical name which is the most explicit variant found in document
8
©2002 Paula Matuszek Examples for Name Extraction l “This subject is taught by Paula Matuszek.” –Recognize Paula as a first name of a person –Recognize Matuszek as a capitalized word following a first name. –Therefore “Paula Matuszek” is probably the name of a person. l “This subject is taught by Villanova University.” –Recognize Villanova as a probable name based on capitalization. –Reognize University as a term which normally names an institution.. –Therefore “Villanova University” is probably the name of an institution. l “This subject is taught by Howard University” –BOTH of these sets of rules could apply. So rules need to be prioritized to determine more likely parse.
9
©2002 Paula Matuszek Other Rule Examples l Dr., Mr,. Ms. are titles, and titles followed by capitalized words frequently indicate names. If followed by only one word, it’s the last name l Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN. l Nouns can be names. Verbs can’t.
10
©2002 Paula Matuszek Abbreviation/Acronym Extraction l Fruitful source of variants for names and terms l Existing dictionary of common terms l Name followed by “(“ [A-Z]+ “)” probably gives an abbreviation. l Conventions regarding word internal case and prefixes. “MSDOS” matches “MicroSoft DOS”, “GB” matches gigabyte.
11
©2002 Paula Matuszek Number Extraction l Useful primarily to improve performance of other extractors. l Variant expressions of numbers –One thousand three hundred and twenty seven –thirteen twenty seven –1327 l Other numeric expressions –twenty-seven percent –27% l Base forms are easy; most of effort is variants and determining canonical form based on rules
12
©2002 Paula Matuszek Date Extraction l Absolute and relative dates l Produces canonical form. –March 27, 19971997/03/27 –tomorrowref+0000/00/01 –a year agoref-0001/00/00 l Similar techniques and issues as for numbers
13
©2002 Paula Matuszek Money Extraction l Recognizes currencies and produces canonical representation l Uses number extractor l Examples –“twenty-seven dollars” “27.000 dollars USA” –“DM 27” “27.000 marks Germany”
14
©2002 Paula Matuszek Term Extraction l Identify other important terms found in text l Other major lexical clue for subject, especially if repeated. l May use output from other extractors in rules l Recognizes common lexical variants and reduces to canonical form -- stemming l Machine learning is much more important here
15
©2002 Paula Matuszek Term Extraction l Dictionary with parts of speech info for English l Pattern matching to find noun phrase structure typical of technical terms. l Feature repositories: –Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics –Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics l Authority and residue dictionaries trained
16
©2002 Paula Matuszek Information Quotient l Each feature (word, phrase, name) extracted is assigned an information quotient l Represents the significance of the feature in the document l TF-IDF: Term frequency-Inverse Document Frequency l Position information l Stop words
17
©2002 Paula Matuszek Feature Extraction Demo l Tool may be used for highlighting, etc, on documents to be displayed l Features extracted also form basis for other tools l Note that this is not full information extraction, although it is a starting point l http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html
18
©2002 Paula Matuszek Other Features l Feature Extractor also identifies other features used by other text analysis tools: –sentence boundaries –paragraph boundaries –document tags –document structure –collection statistics
19
©2002 Paula Matuszek Summarizer Tools l Collection of sentences extracted from document l Characteristic of document content l Works best for well-structured documents l Can specify length l Must apply feature extraction first
20
©2002 Paula Matuszek Summarizer l Feature extractor run first l Words are ranked l Sentences are ranked l Highest ranked sentences are chosen l Configurable: for length of sentence, for word salience l Works best when document is part of a collection
21
©2002 Paula Matuszek Word Ranking l Words scored IF –Appears in structures such as titles and captions –Occurs more often in document than in collection (word salience) –Occurs more than once in a document l Score is –salience if > threshold: tf*idf (by default) –weighting factor if occurs in title, heading caption
22
©2002 Paula Matuszek Sentence Ranking l Scored according to relevance in document and position in document. l Sum of –Scores of individual words –Proximity of sentence to beginning of its paragraph –“Bonus” for final sentence in long paragraph and final paragraph in long documents –Proximity of paragraph to beginning of document l All configurable
23
©2002 Paula Matuszek Summarization Examples l Examples from IBM documentation l http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html
24
©2002 Paula Matuszek Some Common Statistical Measures (a brief digression) l TF x IDF l Pairwise and multiple-word phrase counts l Some other common statistical measures: –information gain: how many bits of information do we gain by knowing that a term is present in a document –mutual information: how likely a term is to occur in a document –term strength: likelihood that a document will occur in both of two closely-related documents
25
©2002 Paula Matuszek Topic Categorization Tool l Assign documents to predetermined categories l Must first be trained –Training tool creates category scheme –Dictionary that stores significant vocabulary statistics l Output is list of possible categories and probabilities for each document l Can filter initial schema for faster processing
26
©2002 Paula Matuszek Features Used for Categorizing l Linguistic Features –Uses the features extracted by Feature Extraction tool l N-Grams –letter groupings and short words. –Can be used for non-English, because it doesn’t depend on heuristics –Used by Language categorizer
27
©2002 Paula Matuszek Document Categorizing l Individual document is analyzed for features l Features are compared to those determined for categories: –terms present/absent –IQ of terms –frequencies –document structure
28
©2002 Paula Matuszek Document Categorization l Important issue is determining which features! High dimensionality is expensive. l Ideally you want a small set of features which is –present in all documents of one category –absent in all other documents l In actuality, not that clean. So: –use features with relatively high separation –eliminate feature which correlates very highly with another feature (to reduce dimension space)
29
©2002 Paula Matuszek Categorization Demo l Typically categorization is a component in a system which then “does something” with the categorized documents l Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving. l http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize. http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.
30
©2002 Paula Matuszek Clustering Tools l Organize documents without pre-existing categories l Hierarchical clustering –creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up l Binary Relational clustering –Creates a flat set of clusters with each document assigned to its best fit and relations between clusters captured
31
©2002 Paula Matuszek Hierarchical Clustering l Input is a set of documents l Output is a dendogram –Root –Intermediate level –leaves –link to actual documents l Slicing is used to create manageable HTML tree
32
©2002 Paula Matuszek Steps in Hierarchical Clustering l Select Linguistic Preprocessing technique: determines “similarity” l Cluster documents: create dendogram based on similairy l Define shape of tree with slicing technique and produce HTML output
33
©2002 Paula Matuszek Linguistic Preprocessing l Determining similarity between documents and clusters: how do we define “similar”? –Lexical affinity. Does not require any preprocessing –Linguistic Features. Requires that feature extractor be run first. l iMiner is either/or; you cannot combine the two methods of determining similiarity
34
©2002 Paula Matuszek Clustering: Lexical Affinities l Lexical affinities: groups of words which appear frequently close together –created “on the fly” during a clustering task –word pairs –stemming and other morphological analysis –stop words l Results in documents with textual similiarity being clustered together
35
©2002 Paula Matuszek Clustering: Linguistic Features l Linguistic features: Use features extracted by the feature extraction tool –Names of organizations –Domain Technical Terms –Names of Individuals l Can allow focusing on specific areas of interest l Best if you have some idea what you are interested in
36
©2002 Paula Matuszek Hierarchical Clustering Steps l Put each document in a cluster, characterized by its lexical or linguistic features l Merge the two most similar clusters l Continue till all clusters are merged
37
©2002 Paula Matuszek Hierarchical Clustering: Slicing l The Dendogram is too big to be useful l Slicing reduces the size of the tree by merging clusters if they are “similar enough”. –top threshold: collapse any tree which exceeds it –bottom threshold: group under root any cluster which is lower –Remaining clusters make a new tree –# of steps sets depth of tree
38
©2002 Paula Matuszek Typical Slicing Parameters l Bottom –start around 5% or 10% similar –90% would mean only virtually identical documents get grouped l Top –good default is 90% –if want really identical, set to 100% l Depth: –Typically 2 to 10 –Two would give you duplicates and rest
39
©2002 Paula Matuszek Binary Relational Clustering l Binary Relational clustering –Creates a flat set of clusters –Each document assigned to its best fit –Relations between clusters captured l Similarity based on features extracted by Feature Extraction tool
40
©2002 Paula Matuszek Relational Clustering: Document Similarity l Based on comparison of descriptors –Frequent descriptors across collection given more weight: priority to wide topics –Rare descriptors given more weight: large number of very focused clusters –Both, with rare descriptors given slightly higher weight: relatively focused topics but fewer clusters l Descriptors are binary: present or absent
41
©2002 Paula Matuszek Relational Clustering l Descriptors are features extracted by feature extraction tool. l Similarity threshold: at 100% only identical documents are clustered l Max # of clusters: overrides similiarity threshold to get number of clusters specified
42
©2002 Paula Matuszek Binary Relational Clustering Outputs l Outputs are –clusters: topics found, importance of topics, degree of similiarity in cluster –links: sets of common descriptors between clusters
43
©2002 Paula Matuszek Clustering Demo l Patents from “class 395”: information processing system organization l 10% for top, 1% for bottom, total of 5 slices l lexical affinity l http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html
44
©2002 Paula Matuszek Summary l iMiner has a rich set of text mining tools l Product is well-developed, stable l No explicit user-modifiable knowledge base -- uses automated techniques and built-in KB to extract relevant information l Can be deployed to new domains without a lot of additional work l BUT not as effective in many domains as a tool with a good KB l No real information extraction capability
45
©2002 Paula Matuszek Information Extraction Overview l Given a body of text: extract from it some well-defined set of information l MUC conferences l Typically draws heavily on NLP l Three main components: –Domain knowledge base –Extraction Engine –Knowledge model
46
©2002 Paula Matuszek Information Extraction Domain Knowledge Base l Terms: enumerated list of strings which are all members of some class. –“January”, “February” –“Smith”, “Wong”, “Martinez”, “Matuszek” –“”lysine”, “alanine”, “cysteine” l Classes: general categories of terms –Monthnames, Last Names, Amino acids –Capitalized nouns, –Verb Phrases
47
©2002 Paula Matuszek Domain Knowledge Base l Rules: LHS, RHS, salience l Left Hand Side (LHS): a pattern to be matched, written as relationships among terms and classes l Right Hand Side (RHS): an action to be taken when the pattern is found l Salience: priority of this rule (weight, strength, confidence)
48
©2002 Paula Matuszek Some Rule Examples: l => l => print “Birthdate”,, l => create address database record l “/” “/” => create date database record (50) l “/” “/” => create date database record (60) l “.” => l => create “relationship” database record
49
©2002 Paula Matuszek Generic KB l Generic KB: KB likely to be useful in many domains –names –dates –places –organizations l Almost all systems have one l Limited by cost of development: it takes about 200 rules to define dates reasonably well, for instance.
50
©2002 Paula Matuszek Domain-specific KB l We mostly can’t afford to build a KB for the entire world. l However, most applications are fairly domain-specific. l Therefore we build domain-specific KBs which identify the kind of information we are interested in. –Protein-protein interactions –airline flights –terrorist activities
51
©2002 Paula Matuszek Domain-specific KBs l Typically start with the generic KBs l Add terminology l Figure out what kinds of information you want to extract l Add rules to identify it l Test against documents which have been human-scored to determine precision and recall for individual items.
52
©2002 Paula Matuszek Knowledge Model l We aren’t looking for documents, we are looking for information. What information? l Typically we have a knowledge model or schema which identifies the information components we want and their relationship l Typically looks very much like a DB schema or object definition
53
©2002 Paula Matuszek Knowledge Model Examples l Personal records –Name –First name –Middle Initial –Last Name –Birthdate –Month –Day –Year –Address
54
©2002 Paula Matuszek Knowledge Model Examples l Protein Inhibitors –Protein name (class?) –Compound name (class?) –Pointer to source –Cache of text –Offset into text
55
©2002 Paula Matuszek Knowledge Model Examples l Airline Flight Record –Airline –Flight l Number l Origin l Destination l Date »Status »departure time »arrival time
56
©2002 Paula Matuszek Summary l Text mining below the document level l NOT typically interactive, because it’s slow (1 to 100 meg of text/hr) l Typically builds up a DB of information which can then be queries l Uses a combination of term- and rule- driven analysis and NLP parsing. l AeroText: very good system developed by LMCO; we will get a complete demo on March 26.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.