Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs, 4.11.01.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Unstructured Data and Text Mining
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
WMES3103 : INFORMATION RETRIEVAL
Journal Citation Reports on the Web. Copyright 2006 Thomson Corporation 2 Introduction JCR distills citation trend data for 7,600+ journals from more.
Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Authors:Jochen Dijrre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Nicholas Romano Text Mining: Finding Nuggets in Mountains.
Author : Jochen Dijrre, Peter Gerstl, Roland Seiffert Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
Introduction to machine learning
CRM Segmentation Segmentation of Textual Data Zhangxi Lin.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
Text Analytics And Text Mining Best of Text and Data
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Text Analytics Prof Sunil Wattal.
Chapter 14. Writing Definitions, Descriptions, and Instructions © 2013 by Bedford/St. Martin's1 What are definitions, descriptions, and instructions? A.
Universit at Dortmund, LS VIII
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Decision Support Systems
De-identification: A Critical Success Factor in Clinical and Population Research Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Text Clustering Hongning Wang
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Copyright Paula Matuszek Kinds of Machine Learning.
March, 2007RCO LLC, RCO Text Analysis Technologies for information extraction and business intelligence We can tell you everything about.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Authors: Jochen Doerre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Caitlin Baker Text Mining: Finding Nuggets in Mountains.
1 Report Writing with Citation and documentation Business and Human Communication BUS-201 BRAC Institute of Languages BRAC University.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
Text Mining: Finding Nuggets in Mountains of Textual Data
What Is Cluster Analysis?
Information Organization: Overview
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Text Mining: Finding Nuggets in Mountains of Textual Data
Text Mining & Natural Language Processing
e-Discovery through Text Mining
Information Organization: Overview
Information Retrieval and Web Search
PolyAnalyst™ text mining tool Allstate Insurance example
System Model Acquisition from Requirements Text
Presentation transcript:

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,

Overview Reasons for Text Mining Special Tasks in Mining Text Disambiguating Proper Names Application Types Customer Intelligence

Reasons for Text Mining Corporate Knowledge “Ore” Exploiting the Knowledge in Text The Value of Mining Text Typical Applications

Corporate Knowledge “Ore” Insurance claims News articles Web pages Patent portfolios Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents

Exploiting Textual Knowledge Knowledge Discovery Knowledge Management

Value of Text Mining Rapid digestion of large corporate documents, faster than human knowledge brokers Objective and customizable analysis Automation of routine tasks

Typical Applications Summarizing documents Monitoring relations among people, places, and organizations Organize documents by content Organize indices for search and retrieval Retrieve documents by content

Special Tasks in Mining Text Interpreting Natural Language Comparison with Data Mining Extracting Terminology and Relations Classifying Documents

Interpreting Natural Language Extracting terminology Extracting relations Summarizing documents Extracting models

Comparison of Procedures Data Mining Identify data sets. Select features manually. Prepare data. Analyze distribution. Text Mining Identify documents. Extract features. Select features by algorithm. Prepare data. Analyze distribution

Terminology and Relations What Terminology Is Classes of Terms Instances of Relations Canonical Forms

What Terminology Is Function words General-purpose content words and phrases Technical content words and phrases Relations

Classes of Terminology Proper names Technical phrases Abbreviations and acronyms

Instances of Relations Facts Dates Currency values Percentages Other measurements

Canonical Forms Numbers convert to normal form. Dates convert to normal form. Inflected forms convert to common form. Alternative names convert to explicit form.

Classifying Documents Hierarchical clustering Binary relational clustering Supervised learning

Disambiguating Proper Names Principles of Nominator Design The Process in Nominator

Principles of Nominator Design Apply heuristics to strings, instead of interpreting semantics. The unit of context for extraction is a document. The unit of context for aggregation is a corpus. The heuristics represent English naming conventions.

Extracting Proper Names Tokenize the words in a document. Build list of candidate names in document. Break candidates into smaller names. Group names into equivalence classes. Aggregate classes from multiple documents.

Candidate Names Extract all sequences of capitalized tokens. Exclude adjectives of provenance (e.g. Mr., Dr., etc.). Exclude certain non-name acronyms (e.g. M.D., PhD.). Include numerals, unless following a preposition, comma, date, or number. Ignore words in section titles. Exclude initial adverbs in sentences.

Splitting Candidates Apply heuristics to conjunctions, prepositions, and possessives. Reconstruct shared words.

Building Equivalence Classes Discard non-recurring initial words of sentences. Unify variants with heuristics. Pick canonical name for each class. Categorize each class with heuristics. Map canonical name to variants. Map variants to canonical name.

Aggregating Classes Merge classes that share a variant in separate documents. Both type and spelling of variant must agree. Replace uncertain categories with certain ones.

Application Types Knowledge Discovery (Clustering) Information Distillation (Categorization)

Knowledge Discovery

Information Distillation

Customer Intelligence Goals Process

Customer Intelligence Goals What do customers want and need? What do customers think of the company?

Customer Intelligence Process Corpus of communications with customers Cluster the documents to identify issues. Characterize the clusters to identify the conditions for problems. Assign new messages to appropriate clusters.

Summary Reasons for Text Mining Special Tasks in Mining Text Disambiguating Proper Names Customer Intelligence

Exam Question #1 Name an example of each of the two main classes of applications of text mining. –Knowledge Discovery: Discovering a common customer complaint among much feedback. –Information Distillation: Filtering future comments into pre-defined categories

Exam Question #2 How does the procedure for text mining differ from the procedure for data mining? –Adds feature extraction function –Not feasible to have humans select features –Highly dimensional, sparsely populated feature vectors

Exam Question #3 In the Nominator program of IBM’s Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? –Does not perform in-depth syntactic or semantic analyses of texts

Questions & Answers