Download presentation
Presentation is loading. Please wait.
Published byLisa Carson Modified over 9 years ago
1
Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007
2
Why Text Mining? A large portion of all available information today exists in the form of unstructured texts (information overload). –Books, magazine articles, research papers, product manuals, memorandums, e-mails, and of course the Web, all contain textual information in the natural language form. A lot of critical information is in the textual format –The voice of customers -- customer email, customer complaints –Product reviews Thus, making correct decisions often requires analyzing large volumes of textual information – Business Intelligence
3
Text Mining (From Wikipedia) Refer generally to the process of deriving high quality information from text. High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Process –structuring the input text –deriving patterns within the structured data –finally evaluation and interpretation of the output Tasks –text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and entity relation modeling
4
Named Entity recognition (NE) –Finds and classifies names, places, etc. Coreference resolution (CO) –Identifies identity relations between entities. Template Element construction (TE) –Adds descriptive information to NE results (using CO). Template Relation construction (TR) –Finds relations between TE entities. Scenario Template production (ST) –Fits TE and TR results into specified event scenarios Structuring the input text Information Extraction
5
Dummy Example “The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.” NE discovers that the entities present are the rocket, Tuesday, Dr. Head and We Build Rockets Inc. CO discovers that it refers to the rocket. TE discovers that the rocket is shiny red and that it is Head’s brainchild. TR discovers that Dr. Head works for We Build Rockets Inc. ST discovers that there was a rocket launching event in which the various entities were involved.
6
Some Systems Attensity Inxight Anderson ClearForest TextAnalyst Linguamatics
7
Attensity http://www.attensity.com/ Founded in early 2000 Culmination of over a decade of research in computational linguistics at the University of Utah The technology allows users to extract and analyze facts like who, what, where, when and why Allows users to drill down to understand people, places and events and how they are related It then creates output in XML and in a structured relational data format that is fused with existing structured data
8
Architecture
9
Attensity: Information Extraction Engine The foundation of all the applications Target extraction –When you know what you are looking for –Entity and event definitions –Creating rules and dictionaries specific to your particular domain –Graphical user interface that allows users to rapidly create definitions Exhaustive extraction –When you are trying to understand what is in your text and you don't exactly know what you are looking for
10
Attensity: Applications Discovery –Mining relations: uncover who, what, where, when, and why Analytics –Support users to drill down –Visualization tools to slice, dice and analyze important facts –Aggregations of facts Text search –Allow approximate matching of query words –Seamlessly combined with the text analysis Classify –Enable users to define document groups Alert –Provide timely visibility to frequent and emerging issues –Product problems, trigger emails or notifications http://www.attensity.com/www/products/applications.php
11
Examples Using Attensity Attensity boasts customers within Global 2000 organizations as well as government agencies Warranty Improvement –reviewing warranty data contained in unstructured, text-based sources such as technician reports, customer surveys and dealer provided information (reduce warranty cost) Understand Voice of the Customer –both structured and unstructured data to detect product problem and customer satisfaction Government Intelligence –identify suspicious activities and relationships, detecting threats to improve homeland security and monitoring of the Internet to uncover illegal activities –improve the reliability and supportability of a variety of military vehicles, weapons and components, by converting unstructured data from service notes and repair logs into relational tables
12
Inxight http://www.inxight.com/ Founded in 1997 Spun out from Xerox PARC Based on 25+ years of research at Xerox PARC Inxight’s ability to “read” text in more than 30 languages Inxight takes information search, retrieval and analysis to an entirely new level.
13
Components Federated & Desktop Search –Support hundreds of high-value information sources through a single, user-friendly interface. –Search results are automatically clustered on-the-fly by extracting and analyzing the most relevant people, places and events –Provide alert functionality of new information (Be alerted when competitors' websites change, monitor a single web page to know the change of a product’s price). –Support different types of search functionalities ("More Like This" Searching) –Having Google desktop search entender. Text Analysis –Extracting the "who," "what," "where" and "when" in each document. (more than 35 types of information) –Automated entity, concept, event and relation extraction, categorization and summarization
14
Components Cont’d Data Cleansing –Human experts can review to clean the extracted data Visualization –Relationship StarTree –Trend TableLens –Timeline TimeWall –Several demos: http://www.inxight.com/products/vizserver/http://www.inxight.com/products/vizserver/
15
Examples Using Inxight Customers: More than 350 Global 2000 customers Financial Data Analysis Crime Analysis Pharmaceutical Research
16
Anderson Designed especially for customer behavior Market Research –Collecting external business information (from customer, competitor, and the market) –Qualitative (answer the “why”) vs Quantitative (answer the “how much/many”) –Hybrid Business Intelligence –Collecting and analyzing internal business information –Focus on business transactions and communications –Sale data, supply logs, financial records
17
ClearForest http://www.clearforest.com/ Tagging Engine –Information extraction –Document categorization Analytics –Improve Early Warning Visibility: Include text-based information to better assess and trigger organizational responses. –Discover Insights: Identify trends, patterns, and complex inter- document relationships within large text collections. –Create Links with Structured Data: Incorporation enhances quality of business intelligence by forging links not previously possible. –Become an Expert: Rapidly comprehend and synthesize complex issues before making key decisions See the simple demo –Automatically identify the people, companies, organizations, geographies and products on the web page
18
TextAnalyst Based on semantic network –a list of the most important words from the text and relations between them Functionalities –Textbase Navigation: concepts in semantic network is connected to sentences, then documents. –Topic Structure: transform semantic network to tree-like list of nested topics –Clustering: eliminating those weak links in the topic structure –Summarization: using semantic network to score sentences.
19
Linguamatics Interactive information extraction (I2E) –Powerful queries (John Smith is the chairman of which company? ) –Graphical interface –Structured output –http://www.linguamatics.com/technology/ie/search_results.htmlhttp://www.linguamatics.com/technology/ie/search_results.html Can take existing ontologies –Synonyms and Canonicalisation –Class information: providing sub- and super-classes (In the Life Science domain, relationships between protein families can point to potential relationships between specific proteins.) –Balancing precision and recall: by moving up/down hierarchy
20
Commonness Information extraction is very important for commercial text mining systems Consider and combine both structured and unstructured data for analysis Alerts are considered as very important Search and mining is highly integrated
21
An IE Toolkit: GATE General Architecture for Text Engineering –University of Sheffield since 1995 –More than 10 years old –Free open source software –Implemented in Java –language analysis contexts including Information Extraction in English, Greek, Spanish, Swedish, German, Italian and French –Easily pluggable and used in a lot other projects –Provide interface as a standalone applications –Pretty slow and memory consuming
22
IE in GATE Named as ANNIE: a Nearly-New Information Extraction System (Show the pdf file for some examples) Tokeniser Gazetteer Sentence Splitter Part of Speech Tagger Semantic Tagger Orthographic Coreference (OrthoMatcher) Pronominal Coreference
23
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.