Practical Text Mining Ronen Feldman Information Systems Department

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Chapter 1 Business Driven Technology
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Link Analysis: Current State of the Art Ronen Feldman Computer Science Department Bar-Ilan University, ISRAEL
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
1.Data categorization 2.Information 3.Knowledge 4.Wisdom 5.Social understanding Which of the following requires a firm to expend resources to organize.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Information Extraction CS 652 Information Extraction and Integration.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
CIS392Semester Projects1 CIS392 Text Processing, Retrieval, and Mining Overview of Semester Projects.
Kashif Jalal CA-240 (072) Web Development Using ASP.NET CA – 240 Kashif Jalal Welcome to week – 2 of…
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Organizing Data & Information
Architecture & Data Management of XML-Based Digital Video Library System Jacky C.K. Ma Michael R. Lyu.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
© 2007 by Prentice Hall 1 Chapter 1: The Database Environment Modern Database Management 8 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
Chapter 1: The Database Environment
Overview of Search Engines
Lecture-8/ T. Nouf Almujally
ASIDIC Spring Conference ‘Smart Content’ Uncovering the Value and Benefits of Semantic Technology Richard C. Fusco Director, Content Strategy – McGraw-Hill.
Text Analytics And Text Mining Best of Text and Data
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Case Study – Venture Portfolio Tracking and Competitive Intelligence Sam Knox - Director of Analyst Services Christopher Cho - Consulting Analyst.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Prevention of money-laundering and fiscal fraud: the BRACCO project David Mitzman, InfoCamere, IT ECRF Budapest, Hungary 14 June, 2010 JLS/2007/ISEC/431.
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
Using Publicly Available Data 20 th Meeting Course Name: Business Intelligence Year: 2009.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Database System Introduction to Database Environment October 31, 2009 Software Park, Bangkok Thailand Pree Thiengburanathum College of Arts and Media Chiang.
Chapter 1 Chapter 1: The Database Environment Modern Database Management 8 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden © 2007 by Prentice.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
ITGS Databases.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Facilitating Document Annotation using Content and Querying Value.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Information Retrieval
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Facilitating Document Annotation Using Content and Querying Value.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Managing Data Resources File Organization and databases for business information systems.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
Fundamentals & Ethics of Information Systems IS 201
PolyAnalyst Data and Text Mining tool
Social Knowledge Mining
CSE 635 Multimedia Information Retrieval
Information Retrieval
Presentation transcript:

Practical Text Mining Ronen Feldman Information Systems Department School of Business Administration Hebrew University, Jerusalem, ISRAEL Ronen.Feldman@huji.ac.il

Background Rapid proliferation of information available in digital format People have less time to absorb more information

The Information Landscape Problem Lack of tools to handle unstructured data Unstructured (Textual) 80% Structured (Databases) 20%

Long lists of documents TM != Search Find Documents matching the Query Display Information relevant to the Query Actual information buried inside documents Extract Information from within the documents Long lists of documents Aggregate over entire collection

Seeing the Forest for the Trees Text Mining Input Output Documents Patterns Connections Profiles Trends Seeing the Forest for the Trees

Let Text Mining Do the Legwork for You Find Material Read Understand Consolidate Absorb / Act

What Is Unique in Text Mining? Feature extraction. Very large number of features that represent each of the documents. The need for background knowledge. Even patterns supported by small number of document may be significant. Huge number of patterns, hence need for visualization, interactive exploration.

Document Types Structured documents Semi-structured documents Output from CGI Semi-structured documents Seminar announcements Job listings Ads Free format documents News Scientific papers

Text Representations Character Trigrams Words Linguistic Phrases Non-consecutive phrases Frames Scripts Role annotation Parse trees

Enterprise Client to ANS Entity, fact & event extraction General Architecture Search Index DB Analytics Enterprise Client to ANS Analytic Server RDBMS XML/ Other DB Output Output API ANS collection Control API Console Entity, fact & event extraction Tagging Platform Categorizer Talk about NASD, MOD and Kodak Headline Generation Language ID Web Crawlers (Agents) File Based Connector RDBMS Connector Programmatic API (SOAP web Service) Tags API

The Language Analysis Stack Events & Facts Domain Specific Entities Candidates, Resolution, Normalization Basic NLP Noun Groups, Verb Groups, Numbers Phrases, Abbreviations Metadata Analysis Title, Date, Body, Paragraph Sentence Marking Language Specific Morphological Analyzer POS Tagging (per word) Stem, Tense, Aspect, Singular/Plural Gender, Prefix/Suffix Separation Tokenization

Components of IE System

Intelligent Auto-Tagging <Facility>Finsbury Park Mosque</Facility> (c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson <Country>England</Country> <Country>France </Country> <Country>England</Country> <Country>Belgium</Country> ……. The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States. ``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'' …… <Country>United States</Country> <Person>Abu Hamza al-Masri</Person> <PersonPositionOrganization>  <OFFLEN OFFSET="3576" LENGTH=“33" />   <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization> <City>London</City> <PersonArrest>  <OFFLEN OFFSET="3814" LENGTH="61" />   <Person>Abu Hamza al-Masri</Person>   <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>

Business Tagging Example <Topic>BusinessNews</Topic> SAP Acquires Virsa for Compliance Capabilities By Renee Boucher Ferguson April 3, 2006 Honing its software compliance skills, SAP announced April 3 the acquisition of Virsa Systems, a privately held company that develops risk management software. Terms of the deal were not disclosed. SAP has been strengthening its ties with Microsoft over the past year or so. The two software giants are working on a joint development project, Mendocino, which will integrate some MySAP ERP (enterprise resource planning) business processes with Microsoft Outlook. The first product is expected in 2007. "Companies are looking to adopt an integrated view of governance, risk and compliance instead of the current reactive and fragmented approach," said Shai Agassi, president of the Product and Technology Group and executive board member of SAP, in a statement. "We welcome Virsa employees, partners and customers to the SAP family." <Acquisition offset="494" length="130">   <Company_Acquirer>SAP</Company_Acquirer>   <Company_Acquired>Virsa Systems </Company_Acquired>   <Status>known</Status> </Acquisition> <Company>SAP</Company> <Company>Virsa Systems</Company> <IndustryTerm>risk management software</IndustryTerm> <Company>Microsoft</Company> <Product>MySAP ERP</Product> <Product>Microsoft Outlook</Product> <Person>Shai Agassi</Person> <PersonProfessional offset="2789" length="92"> <Position>president of the Product and Technology Group and executive board member</Position> </PersonProfessional>

Company: Virsa Systems Professional: Name: Shai Agassi Company: SAP Position: President of the Product and Technology Group and executive board member Acquisition: Acquirer:SAP Acquired: Virsa Systems Company: SAP Person: Shai Agassi Company: Virsa Systems IndustryTerm: risk management software Company: Microsoft Product: MySAP ERP Product: Microsoft Outlook

Leveraging Content Investment Any type of content Unstructured textual content (current focus) Structured data; audio; video (future) In any format Documents; PDFs; E-mails; articles; etc “Raw” or categorized Formal; informal; combination Text Mining From any source WWW; file systems; news feeds; etc. Single source or combined sources

Link Analysis in Textual Networks

Running Example

Kamada and Kawai’s (KK) Method

Finding the shortest Path (from Atta)

A better Visualization

Summary Diagram

Information Extraction Theory and Practice

What is Information Extraction? IE does not indicate which documents need to be read by a user, it rather extracts pieces of information that are salient to the user's needs. Links between the extracted information and the original documents are maintained to allow the user to reference context. The kinds of information that systems extract vary in detail and reliability. Named entities such as persons and organizations can be extracted with reliability in the 90th percentile range, but do not provide attributes, facts, or events that those entities have or participate in.

Relevant IE Definitions Entity: an object of interest such as a person or organization. Attribute: a property of an entity such as its name, alias, descriptor, or type. Fact: a relationship held between two or more entities such as Position of a Person in a Company. Event: an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction.

IE Accuracy by Information Type Entities 90-98% Attributes 80% Facts 60-70% Events 50-60%

MUC Conferences Conference Year Topic MUC 1 1987 MUC 2 1989 MUC 3 1991 Naval Operations MUC 2 1989 MUC 3 1991 Terrorist Activity MUC 4 1992 MUC 5 1993 Joint Venture and Micro Electronics MUC 6 1995 Management Changes MUC 7 1997 Spaces Vehicles and Missile Launches

Applications of Information Extraction Routing of Information Infrastructure for IR and for Categorization (higher level features) Event Based Summarization. Automatic Creation of Databases and Knowledge Bases.

Approaches for Building IE Systems Knowledge Engineering Approach Rules are crafted by linguists in cooperation with domain experts. Most of the work is done by inspecting a set of relevant documents. Can take a lot of time to fine tune the rule set. Best results were achieved with KB based IE systems. Skilled/gifted developers are needed. A strong development environment is a MUST!

Approaches for Building IE Systems Automatically Trainable Systems The techniques are based on pure statistics and almost no linguistic knowledge They are language independent The main input is an annotated corpus Need a relatively small effort when building the rules, however creating the annotated corpus is extremely laborious. Huge number of training examples is needed in order to achieve reasonable accuracy. Hybrid approaches can utilize the user input in the development loop.

Sentiment Analysis from User Forums Ronen Feldman Information Systems Department School of Business Administration Hebrew University, Jerusalem, ISRAEL Ronen.Feldman@huji.ac.il

Research Objective Can we use the Web as a marketing research playground? Uncovering market structure from information consumers are posting on the web An example of the rapidly growing area of sentiment mining

What are we going to do? Text mine consumer postings Use network analysis framework and other methods of analysis to reveal the underlying market structure 35

Example Applications Three applications Running shoes (“professionals” community) Sedan cars (mature and common market) iPhone (innovation, pre-during-after launch) 36

Text Examples "I have some experience with the Burn, and I race in the T4 Racers. Both of them have a narrow heel and are slightly wider in the forefoot. The most noticeable thing about the T4s (for most people) is the arch. It has never bothered me, but some people are really annoyed by it. You can cut the arch out of the insole if it bothers you. The Burn's arch is not as pronounced." “ Honda Accords and Toyota Camrys are nice sedans, but hardly the best car on the road (for many people). It's just that they are very compentant in their price range. So, a love fest of the best selling may not tell you what is "best". That depends very much on what is important to you. A car could have a quirk, that you would just love, but not be popular to many people. Thus, the best car for you might not sell many. If you are looking for resale value, then it might be a factor." 37

The Car Models Network

MDS of Brands Lift

Model-Term Analysis – 2 Mode Network

Most Stolen Cars Analysis The National Insurance Crime Bureau (NICB®) has compiled a list of the 10 vehicles most frequently reported stolen in the U.S. in 2005 1) 1991 Honda Accord 2) 1995 Honda Civic 3) 1989 Toyota Camry 4) 1994 Dodge Caravan 5) 1994 Nissan Sentra 6) 1997 Ford F150 Series 7) 1990 Acura Integra 8) 1986 Toyota Pickup 9) 1993 Saturn SL 10) 2004 Dodge Ram Pickup Top 10 cars mentioned with “stealing” phrases in our data (“Stolen”, “Steal”, “Theft”) 1) Honda Accord (165) 2) Honda Civic (101) 3) Toyota Camry (71) 4) Nissan Maxima (69) 5) Acura TL (58) 6) Infinity G35 (44) 7) BMW 3-Series (40) 8) Hyundai Sonata (26) 9) Nissan Altima (25) 10) Volkswagen Passat (23)