1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem.

Slides:



Advertisements
Similar presentations
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Advertisements

Data Mining: Concepts and Techniques Mining Text Data
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
Week 9 Data Mining System (Knowledge Data Discovery)
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Web Mining Research: A Survey
1 Information Retrieval and Web Search Introduction.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Decision Support and Business Intelligence Systems (9 th Ed., Prentice Hall) Chapter 7: Text and Web Mining.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Lecture 18 Text Data Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Text Analytics Prof Sunil Wattal.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Chapter 6: Information Retrieval and Web Search
Decision Support Systems
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.
Introduction to Text Mining By Soumyajit Manna 11/10/08.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Data Mining: Text Mining
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Organization: Overview
School of Computer Science & Engineering
Chapter 7: Text and Web Mining
Information Retrieval and Web Search
Natural Language Processing (NLP)
Information Retrieval and Web Search
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Text & Web Mining 9/22/2018.
Information Retrieval and Web Search
Sangeeta Devadiga CS 157B, Spring 2007
Prepared by: Mahmoud Rafeek Al-Farra
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Natural Language Processing (NLP)
Information Retrieval and Web Design
Information Retrieval and Web Search
Introduction to Search Engines
Natural Language Processing (NLP)
Presentation transcript:

1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem Al-Radaideh Dr. Samer Samara

2 2 Text Mining Main Source: Several Sources from the Internet

3 Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in He paid $200,000 under a15-year loan from MW Financial. Frank Rizzo Bought this home from Lake View Real Estate In Loans($200K,[map],...) Mining Text Data: An Introduction

Text Mining Definition and Motivation Motivation: Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation). 90% Structured Numerical or Coded Information 10% Unstructured or Semi-structured Information Definition: Text mining is the process of compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry. Sources of textual information: News articles Web pages Patent portfolios Books Customer communications Contracts Technical documents Insurance claims Scientific articles Plus add your own!

5 Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, messages, and Web pages, library database, etc. Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Text Mining (or Information Extraction) Extract from the text what the document means.

Search vs. Discovery Data Mining Text Mining Data Retrieval Information Retrieval Search (goal-oriented) Discover (opportunistic) Structured Data Unstructured Data (Text)

Text Mining Applications Marketing: Discover distinct groups of potential buyers according to a user text based profile e.g. amazon Industry: Identifying groups of competitors web pages e.g., competing products and their prices Job seeking: Identify parameters in searching for jobs e.g., Biomedical Data: Extract pieces of evidence from article titles in the biomedical literature “stress is associated with migraines” “stress can lead to loss of magnesium” “calcium channel blockers prevent some migraines” “magnesium is a natural calcium channel blocker”

Text mining process Text preprocessing Syntactic/Semantic text analysis Tokenization and Text Clean up Text Tagging Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results

Text Mining Process The three-step text mining process 9

Text Mining Terminology Unstructured or semistructured data Corpus (and corpora) Terms Concepts Stemming Stop words (and include words) Synonyms (and polysemes) Tokenizing Term dictionary Word frequency Part-of-speech tagging (POS) Morphology Term-by-document matrix (TDM) Occurrence matrix Singular Value Decomposition (SVD) Latent Semantic Indexing (LSI) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 10

Natural Language (NL) Processing and Text Mining Unstructured Text Text Corpus in Natural Language Grammatical Parsing Text Data Preparation Text DB, Regular Expr., Indices, Term-Doc. Matrices Analyzed Structured Text Structured Text Text Mining Natural Language Processing Linguistics study NL, the words, the rules that we use to form meaningful utterance (expression) Computer programs for NL processing use grammatical rules ( parsing NL text) to mimic human communication and convert NL into structured text for further analysis.

12 Bag-of-Tokens : Example Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or … nation – 5 civil - 1 war – 2 men – 2 died – 4 people – 5 Liberty – 1 God – 1 … Feature Extraction Documents Token Sets

13 Natural Language Processing: Illustrative Example A dog is chasing a boy on the playground DetNounAuxVerbDetNounPrepDetNoun Noun Phrase Complex Verb Noun Phrase Prep Phrase Verb Phrase Sentence Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)

Text Mining Tasks Text mining can address the same basic tasks as data mining Text classification, Text clustering, Text Summarization Documents Association etc.). The difference is that text is more difficult to mine than structured data. E.g., in document clustering, we can transform the documents into vectors using text processing techniques and the vector model, and then apply a clustering algorithm to the docs

Text Classification (1) Motivation: Automated document classification is an important text mining task because, with the existence of a tremendous number of on-line documents (Web pages, s, corporate intranets, etc.), it is tedious yet essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. Definition: Text Classification (TC) is to assign a document to one out of a predefined set of categories (classes). Set of documents divided into training set (with pre-classified docs) and test set (with unknown-category docs) TC is similar to classification in data mining, with some differences: In data mining the data is structured (records with attribute-value pairs) In text mining the data is unstructured or semi-structured (title, etc.) We can represent docs by sets of words (vector model), structuring the data, but the number of attributes (words) is usually very large

16 Text Classification (2) Text Classification and IR: In text classification there are many predefined categories (classes), the prediction of all classes is important IR involves just two “classes” of documents: relevant and irrelevant to the user query Unlike IR, in text classification typically a document can belong to multiple categories Classification Process Data preprocessing Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents Applications: News article classification Automatic filtering Webpage classification … …

training set Given: a collection of labeled records (training set) attributeslabel Each record contains a set of features (attributes), and the true class (label) model Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible test set A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition Categorization System … Sports Business Education Science … Sports Business Education

Text Classification: An Example class Training Set Model Learn Classifier text Test Set

19 Document Clustering Motivation Automatically group related documents based on their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime Clustering Process Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying clustering algorithms. Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM)

Similarity Measures: Euclidean Distance Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents similarity measure Given: a set of documents and a similarity measure among documents Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another Goal: correct Finding a correct set of documents Text Clustering Clustering System Similarity measure Documents source Doc

Commercial Text Mining systems ClearForest Megaputer SAS/Enterprise-Miner SPSS -Clementine Oracle -ConText IBM -Intelligent Miner for Text

Open-Source Text Mining Tools Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents. Mallet - Mallet is a collection of tools in Java for statistical NLP, text classification, clustering and IE. Kea Mallet LingPipe - is a java tool for information extraction and data mining (entity extraction, speech tagging, clustering, classification, etc...). LingPipe GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. GATE NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. NTLK

Product: Intelligent Miner for Text (IMT)

Voice, Hearing, Gestures (speaking, listening, communicating) Phonology (rules for sounds) Pragmatics (rules for language use in context) Lexicon (words, regular and irregular forms) Morphology (rules for forming complex words) Syntax ( rules for forming phrases, sentences) semantics (meaning expressed through language) Brain, Mind (thoughts belief,desired)