Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Semantics Rule, Keywords Drool J. Brooke Aker CEO Expert System USA February 2010.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
System Engineering Instructor: Dr. Jerry Gao. System Engineering Jerry Gao, Ph.D. Jan System Engineering Hierarchy - System Modeling - Information.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Class 11 Decision Making, Decision Support Systems, & Executive Information Systems MIS 2000Decision Making and Information Systems.
Overview of Search Engines
Chapter 2: Business Intelligence Capabilities
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
What is Business Intelligence? Business intelligence (BI) –Range of applications, practices, and technologies for the extraction, translation, integration,
Exploring Marketing Research William G. Zikmund Chapter 2: Information Systems and Knowledge Management.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Mantova 18/10/2002 "A Roadmap to New Product Development" Supporting Innovation Through The NPD Process and the Creation of Spin-off Companies.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
Survey of Semantic Annotation Platforms
Information Extraction From Medical Records by Alexander Barsky.
DECISION SUPPORT SYSTEM ARCHITECTURE: The data management component.
IS Today (Valacich & Schneider) 5/e Copyright © 2012 Pearson Education, Inc. Published as Prentice Hall 10/5/ With the help of their data warehouse.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Decision Support Systems Chapter 10.
Ontology-Based Information Extraction: Current Approaches.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
29-30 October, 2006, Estonia 1 IST4Balt Information analysis using social bookmarking and other tools IST4Balt Information analysis using social bookmarking.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
ITGS Databases.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
WEB PAGE CONTENTS VERIFICATION AGAINST TAGS USING DATA MINING TOOL IKNOW VІI scientific and practical seminar with international participation "Economic.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Realtime Financial Monitoring and Analysis System May 2010 Lietu Search Engine.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Data mining in web applications
Information Retrieval in Practice
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Information Extraction
Social Knowledge Mining
MANAGING DATA RESOURCES
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007

Why Text Mining? A large portion of all available information today exists in the form of unstructured texts (information overload). –Books, magazine articles, research papers, product manuals, memorandums, s, and of course the Web, all contain textual information in the natural language form. A lot of critical information is in the textual format –The voice of customers -- customer , customer complaints –Product reviews Thus, making correct decisions often requires analyzing large volumes of textual information – Business Intelligence

Text Mining (From Wikipedia) Refer generally to the process of deriving high quality information from text. High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Process –structuring the input text –deriving patterns within the structured data –finally evaluation and interpretation of the output Tasks –text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and entity relation modeling

Named Entity recognition (NE) –Finds and classifies names, places, etc. Coreference resolution (CO) –Identifies identity relations between entities. Template Element construction (TE) –Adds descriptive information to NE results (using CO). Template Relation construction (TR) –Finds relations between TE entities. Scenario Template production (ST) –Fits TE and TR results into specified event scenarios Structuring the input text  Information Extraction

Dummy Example “The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.” NE discovers that the entities present are the rocket, Tuesday, Dr. Head and We Build Rockets Inc. CO discovers that it refers to the rocket. TE discovers that the rocket is shiny red and that it is Head’s brainchild. TR discovers that Dr. Head works for We Build Rockets Inc. ST discovers that there was a rocket launching event in which the various entities were involved.

Some Systems Attensity Inxight Anderson ClearForest TextAnalyst Linguamatics

Attensity Founded in early 2000 Culmination of over a decade of research in computational linguistics at the University of Utah The technology allows users to extract and analyze facts like who, what, where, when and why Allows users to drill down to understand people, places and events and how they are related It then creates output in XML and in a structured relational data format that is fused with existing structured data

Architecture

Attensity: Information Extraction Engine The foundation of all the applications Target extraction –When you know what you are looking for –Entity and event definitions –Creating rules and dictionaries specific to your particular domain –Graphical user interface that allows users to rapidly create definitions Exhaustive extraction –When you are trying to understand what is in your text and you don't exactly know what you are looking for

Attensity: Applications Discovery –Mining relations: uncover who, what, where, when, and why Analytics –Support users to drill down –Visualization tools to slice, dice and analyze important facts –Aggregations of facts Text search –Allow approximate matching of query words –Seamlessly combined with the text analysis Classify –Enable users to define document groups Alert –Provide timely visibility to frequent and emerging issues –Product problems, trigger s or notifications

Examples Using Attensity Attensity boasts customers within Global 2000 organizations as well as government agencies Warranty Improvement –reviewing warranty data contained in unstructured, text-based sources such as technician reports, customer surveys and dealer provided information (reduce warranty cost) Understand Voice of the Customer –both structured and unstructured data to detect product problem and customer satisfaction Government Intelligence –identify suspicious activities and relationships, detecting threats to improve homeland security and monitoring of the Internet to uncover illegal activities –improve the reliability and supportability of a variety of military vehicles, weapons and components, by converting unstructured data from service notes and repair logs into relational tables

Inxight Founded in 1997 Spun out from Xerox PARC Based on 25+ years of research at Xerox PARC Inxight’s ability to “read” text in more than 30 languages Inxight takes information search, retrieval and analysis to an entirely new level.

Components Federated & Desktop Search –Support hundreds of high-value information sources through a single, user-friendly interface. –Search results are automatically clustered on-the-fly by extracting and analyzing the most relevant people, places and events –Provide alert functionality of new information (Be alerted when competitors' websites change, monitor a single web page to know the change of a product’s price). –Support different types of search functionalities ("More Like This" Searching) –Having Google desktop search entender. Text Analysis –Extracting the "who," "what," "where" and "when" in each document. (more than 35 types of information) –Automated entity, concept, event and relation extraction, categorization and summarization

Components Cont’d Data Cleansing –Human experts can review to clean the extracted data Visualization –Relationship  StarTree –Trend  TableLens –Timeline  TimeWall –Several demos:

Examples Using Inxight Customers: More than 350 Global 2000 customers Financial Data Analysis Crime Analysis Pharmaceutical Research

Anderson Designed especially for customer behavior Market Research –Collecting external business information (from customer, competitor, and the market) –Qualitative (answer the “why”) vs Quantitative (answer the “how much/many”) –Hybrid Business Intelligence –Collecting and analyzing internal business information –Focus on business transactions and communications –Sale data, supply logs, financial records

ClearForest Tagging Engine –Information extraction –Document categorization Analytics –Improve Early Warning Visibility: Include text-based information to better assess and trigger organizational responses. –Discover Insights: Identify trends, patterns, and complex inter- document relationships within large text collections. –Create Links with Structured Data: Incorporation enhances quality of business intelligence by forging links not previously possible. –Become an Expert: Rapidly comprehend and synthesize complex issues before making key decisions See the simple demo –Automatically identify the people, companies, organizations, geographies and products on the web page

TextAnalyst Based on semantic network –a list of the most important words from the text and relations between them Functionalities –Textbase Navigation: concepts in semantic network is connected to sentences, then documents. –Topic Structure: transform semantic network to tree-like list of nested topics –Clustering: eliminating those weak links in the topic structure –Summarization: using semantic network to score sentences.

Linguamatics Interactive information extraction (I2E) –Powerful queries (John Smith is the chairman of which company? ) –Graphical interface –Structured output – Can take existing ontologies –Synonyms and Canonicalisation –Class information: providing sub- and super-classes (In the Life Science domain, relationships between protein families can point to potential relationships between specific proteins.) –Balancing precision and recall: by moving up/down hierarchy

Commonness Information extraction is very important for commercial text mining systems Consider and combine both structured and unstructured data for analysis Alerts are considered as very important Search and mining is highly integrated

An IE Toolkit: GATE General Architecture for Text Engineering –University of Sheffield since 1995 –More than 10 years old –Free open source software –Implemented in Java –language analysis contexts including Information Extraction in English, Greek, Spanish, Swedish, German, Italian and French –Easily pluggable and used in a lot other projects –Provide interface as a standalone applications –Pretty slow and memory consuming

IE in GATE Named as ANNIE: a Nearly-New Information Extraction System (Show the pdf file for some examples) Tokeniser Gazetteer Sentence Splitter Part of Speech Tagger Semantic Tagger Orthographic Coreference (OrthoMatcher) Pronominal Coreference

Thanks