Text mining michel.bruley@teradata.com Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005.
PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.
Information Retrieval in Practice
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
DATA WAREHOUSING.
1 Information Retrieval and Web Search Introduction.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Social CRM December 2012 Extract from various presentations: Altimeter, Forrester, Teradata Aster, …
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
Chapter 2: Business Intelligence Capabilities
Redefining Perspectives A thought leadership forum for technologists interested in defining a new future June COPYRIGHT ©2015 SAPIENT CORPORATION.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Machine Learning Extract from various presentations: University of Nebraska, Scott, Freund, Domingo, Hong,
Information Retrieval
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Data mining in web applications
Information Retrieval in Practice
SNS COLLEGE OF TECHNOLOGY
Information Organization: Overview
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
Search Engine Architecture
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
CSE 635 Multimedia Information Retrieval
Course Introduction CSC 576: Data Mining.
Introduction to Information Retrieval
Big DATA.
Web archives as a research subject
Information Organization: Overview
Information Retrieval and Web Search
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

Text mining michel.bruley@teradata.com Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Information context Big amount of information is available in textual form in databases and online sources In this context, manual analysis and effective extraction of useful information are not possible It is relevant to provide automatic tools for analyzing large textual collections

Text mining definition The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc. The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods

Text mining pipeline Unstructured Text (implicit knowledge) Information Retrieval Information extraction Knowledge Discovery Semantic metadata Structured content (explicit knowledge) Semantic Search/ Data Mining

Iterative and interactive process Text mining process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results Mapping/Visualization Result interpretation Iterative and interactive process

Text mining actors Publishers Analysts Libraries Enriched content Annotation tools Tools for authors New applications based on annotation layers Richer cross linking based on content… Analysts Empowers them Annotating research output Hypothesis generation Summarisation of findings Focused semantic search… Libraries Linking between Institutional repositories Access to richer metadata Aggregation Aids to subject analysis/classification …

Challenges in text mining Data collection is “free text”, is not well-organized (Semi-structured or unstructured) No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information Learning techniques for processing text typically need annotated training XML as the common model, it allows: Manipulation data with standards Mining becomes more data mining RDF emerging as a complementary model The more structure you can explore the better you can do mining

Data source administration Intranet Internet On-line Databank Information Provider File System Databases EDMS Web Crawling XML Normalisation -subject -Author -text corpora -keywords Format filter Input Data System: This part of the system is related to the collection of the data. -Getting data from the internet with a crawler -Getting data from Online vendors -Getting data from the internal data banks Regarding the input format (physical and logical), data are physicaly reformated into html format and then it's loaded into an XML format

Text mining tasks Text Analysis Tools Name Extractions Term Extraction Feature extraction Categorization Summarization Clustering Name Extractions Term Extraction Abbreviation Extraction Relationship Extraction Hierarchical Clustering Binary relational Clustering Web Searching Text search engine NetQuestion Solution Web Crawler Feature extraction tools It recognizes significant vocabulary items in documents, and measures their importance to the document content. 2. Clustering tools Clustering is used to segment a document collection into subsets, called clusters. 3. Summarization tool Summarization is the process of condensing a source text into a shorter version preserving its information content. 4. Categorization tool Categorization is used to assign objects to predefined categories, or classes from a taxonomy.

Information extraction Link Analysis Query Log Analysis Metadata Extraction Keyword Ranking Intelligent Match Duplicate Elimination Extract domain-specific information from natural language text Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”) Constructed by hand Automatically learned from hand-annotated training data Need a semantic lexicon (dictionary of words with semantic category labels) Typically constructed by hand

Document collections treatment Categorization Clustering

Text Mining example: Obama vs. McCain http://services.alphaworks.ibm.com/manyeyes/view/SWhH8QsOtha6qL3F~y5HQ2~

Aster Data position for Text Analysis Data Acquisition Pre-Processing Mining Analytic Applications Gather text from relevant sources (web crawling, document scanning, news feeds, Twitter feeds, …) Perform processing required to transform and store text data and information (stemming, parsing, indexing, entity extraction, …) Apply data mining techniques to derive insights about stored information (statistical analysis, classification, natural language processing, …) Leverage insights from text mining to provide information that improves decisions and processes (sentiment analysis, document management, fraud analysis, e-discovery, ...) Aster Data Fit Third-Party Tools Fit Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries

Aster Data Value for Text Analytics Ability to store and process massive volumes of text data Massively parallel data stores and massively parallel analytics engine SQL-MapReduce framework enables in-database processing for specialized text analytics tools Tools and extensibility for processing diverse text data SQL-MapReduce framework enables loading and transforming diverse sources and types of text data Pre-built functions for text processing Flexible platform for building and processing diverse analytics SQL-MapReduce framework enables creation of flexible, reusable analytics Embedded MapReduce processing engine for high-performance analytics

Aster Data Capabilities for Text Data Pre-built SQL-MapReduce functions for text processing Data transformation utilities Pack: compress multi-column data into a single column Unpack: extract nested data for further analysis Web log analysis Sessionization: identify unique browsing sessions in clickstream data Text analysis Text parser: general tool for tokenizing, stemming, and counting text data nGram: split text into component parts (words & phrases) Levenstein distance: compute “distance” between words Custom and Packaged Analytics Aster Data nCluster App App App App App App Aster Data Analytic Foundation SQL SQL-MapReduce Data