Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem.

Similar presentations


Presentation on theme: "1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem."— Presentation transcript:

1 1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem Al-Radaideh Dr. Samer Samara

2 2 2 Text Mining Main Source: Several Sources from the Internet

3 3 Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200,000 under a15-year loan from MW Financial. Frank Rizzo Bought this home from Lake View Real Estate In 1992.... Loans($200K,[map],...) Mining Text Data: An Introduction

4 Text Mining Definition and Motivation Motivation: Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation). 90% Structured Numerical or Coded Information 10% Unstructured or Semi-structured Information Definition: Text mining is the process of compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry. Sources of textual information: Email News articles Web pages Patent portfolios Books Customer communications Contracts Technical documents Insurance claims Scientific articles Plus add your own!

5 5 Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Text Mining (or Information Extraction) Extract from the text what the document means.

6 Search vs. Discovery Data Mining Text Mining Data Retrieval Information Retrieval Search (goal-oriented) Discover (opportunistic) Structured Data Unstructured Data (Text)

7 Text Mining Applications Marketing: Discover distinct groups of potential buyers according to a user text based profile e.g. amazon Industry: Identifying groups of competitors web pages e.g., competing products and their prices Job seeking: Identify parameters in searching for jobs e.g., www.flipdog.com Biomedical Data: Extract pieces of evidence from article titles in the biomedical literature “stress is associated with migraines” “stress can lead to loss of magnesium” “calcium channel blockers prevent some migraines” “magnesium is a natural calcium channel blocker”

8 Text mining process Text preprocessing Syntactic/Semantic text analysis Tokenization and Text Clean up Text Tagging Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results

9 Text Mining Process The three-step text mining process 9

10 Text Mining Terminology Unstructured or semistructured data Corpus (and corpora) Terms Concepts Stemming Stop words (and include words) Synonyms (and polysemes) Tokenizing Term dictionary Word frequency Part-of-speech tagging (POS) Morphology Term-by-document matrix (TDM) Occurrence matrix Singular Value Decomposition (SVD) Latent Semantic Indexing (LSI) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 10

11 Natural Language (NL) Processing and Text Mining Unstructured Text Text Corpus in Natural Language Grammatical Parsing Text Data Preparation Text DB, Regular Expr., Indices, Term-Doc. Matrices Analyzed Structured Text Structured Text Text Mining Natural Language Processing 1 2 3 4 Linguistics study NL, the words, the rules that we use to form meaningful utterance (expression) Computer programs for NL processing use grammatical rules ( parsing NL text) to mimic human communication and convert NL into structured text for further analysis.

12 12 Bag-of-Tokens : Example Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or … nation – 5 civil - 1 war – 2 men – 2 died – 4 people – 5 Liberty – 1 God – 1 … Feature Extraction Documents Token Sets

13 13 Natural Language Processing: Illustrative Example A dog is chasing a boy on the playground DetNounAuxVerbDetNounPrepDetNoun Noun Phrase Complex Verb Noun Phrase Prep Phrase Verb Phrase Sentence Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference (Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)

14 Text Mining Tasks Text mining can address the same basic tasks as data mining Text classification, Text clustering, Text Summarization Documents Association etc.). The difference is that text is more difficult to mine than structured data. E.g., in document clustering, we can transform the documents into vectors using text processing techniques and the vector model, and then apply a clustering algorithm to the docs

15 Text Classification (1) Motivation: Automated document classification is an important text mining task because, with the existence of a tremendous number of on-line documents (Web pages, e-mails, corporate intranets, etc.), it is tedious yet essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. Definition: Text Classification (TC) is to assign a document to one out of a predefined set of categories (classes). Set of documents divided into training set (with pre-classified docs) and test set (with unknown-category docs) TC is similar to classification in data mining, with some differences: In data mining the data is structured (records with attribute-value pairs) In text mining the data is unstructured or semi-structured (title, etc.) We can represent docs by sets of words (vector model), structuring the data, but the number of attributes (words) is usually very large

16 16 Text Classification (2) Text Classification and IR: In text classification there are many predefined categories (classes), the prediction of all classes is important IR involves just two “classes” of documents: relevant and irrelevant to the user query Unlike IR, in text classification typically a document can belong to multiple categories Classification Process Data preprocessing Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents Applications: News article classification Automatic email filtering Webpage classification … …

17 training set Given: a collection of labeled records (training set) attributeslabel Each record contains a set of features (attributes), and the true class (label) model Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible test set A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition Categorization System … Sports Business Education Science … Sports Business Education

18 Text Classification: An Example class Training Set Model Learn Classifier text Test Set

19 19 Document Clustering Motivation Automatically group related documents based on their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime Clustering Process Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying clustering algorithms. Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars”. (e.g.: SOM)

20 Similarity Measures: Euclidean Distance Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents similarity measure Given: a set of documents and a similarity measure among documents Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another Goal: correct Finding a correct set of documents Text Clustering Clustering System Similarity measure Documents source Doc

21 Commercial Text Mining systems ClearForest Megaputer SAS/Enterprise-Miner SPSS -Clementine Oracle -ConText IBM -Intelligent Miner for Text

22 Open-Source Text Mining Tools Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents. Mallet - Mallet is a collection of tools in Java for statistical NLP, text classification, clustering and IE. Kea Mallet LingPipe - is a java tool for information extraction and data mining (entity extraction, speech tagging, clustering, classification, etc...). LingPipe GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. GATE NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. NTLK

23 Product: Intelligent Miner for Text (IMT)

24

25 Voice, Hearing, Gestures (speaking, listening, communicating) Phonology (rules for sounds) Pragmatics (rules for language use in context) Lexicon (words, regular and irregular forms) Morphology (rules for forming complex words) Syntax ( rules for forming phrases, sentences) semantics (meaning expressed through language) Brain, Mind (thoughts belief,desired)


Download ppt "1 CIS 467 :Data Mining Department of Computer Information Systems Faculty of Information Technology Yarmouk University – Jordan Instructors: Dr. Qasem."

Similar presentations


Ads by Google