Data Mining: 8. Text Mining Romi Satria Wahono WA/SMS: +6281586220090 1.

Slides:



Advertisements
Similar presentations
Software Engineering: Research Romi Satria Wahono
Advertisements

Data Mining: Penelitian Data Mining
Data Mining: Metode dan Algoritma
BPMN Fundamentals Romi Satria Wahono WA/SMS:
Chapter 5: Introduction to Information Retrieval
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Data Mining: 5. Penelitian Data Mining Romi Satria Wahono WA/SMS:
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
SAK 5609 DATA MINING Prof. Madya Dr. Md. Nasir bin Sulaiman
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
BPMN Fundamentals: 4. BPMN Refactoring Romi Satria Wahono WA:
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Data Mining: 3. Persiapan Data
Introduction to Data Mining Engineering Group in ACL.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
TOGAF 9 Fundamental: 2. Basic Concepts
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
CLassification TESTING Testing classifier accuracy
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Chapter 1 Introduction to Data Mining
Data Mining Applied to Document Imaging Jeff Rekoske.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
Ontology-Based Information Extraction: Current Approaches.
Knowledge Management: 2. Foundations Romi Satria Wahono WA/SMS:
BPMN Fundamentals: 2. BPMN Basic Concepts Romi Satria Wahono WA:
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Knowledge Management: 3. Solutions Romi Satria Wahono WA/SMS:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Theory of Computation 5. Pushdown Automata
​ Text Analytics ​ Teradata & Sabanci University ​ April, 2015.
TOGAF 9 Fundamental: 3. TOGAF ADM
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Data Mining: 3. Persiapan Data Romi Satria Wahono WA/SMS:
BPMN Fundamentals Romi Satria Wahono WA/SMS:
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
BPMN Fundamentals: 5. BPMN Guide and Examples
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data mining in web applications
Information Retrieval in Practice
A Simple Approach for Author Profiling in MapReduce
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
BPMN Fundamentals: 4. BPMN Refactoring
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
A Straightforward Author Profiling Approach in MapReduce
Data Mining: Concepts and Techniques Course Outline
iSRD Spam Review Detection with Imbalanced Data Distributions
Course Introduction CSC 576: Data Mining.
Chapter 5: Information Retrieval and Web Search
Data Warehousing Data Mining Privacy
Presentation transcript:

Data Mining: 8. Text Mining Romi Satria Wahono WA/SMS:

2 Romi Satria Wahono SD Sompok Semarang (1987) SMPN 8 Semarang (1990) SMA Taruna Nusantara Magelang (1993) B.Eng, M.Eng and Ph.D in Software Engineering from Saitama University Japan ( ) Universiti Teknikal Malaysia Melaka (2014) Research Interests: Software Engineering, Machine Learning Founder dan Koordinator IlmuKomputer.Com Peneliti LIPI ( ) Founder dan CEO PT Brainmatics Cipta Informatika

8. Text Mining 7. Algoritma Estimasi dan Forecasting 6. Algoritma Asosiasi 5. Algoritma Klastering 4. Algoritma Klasifikasi 3. Persiapan Data 2. Proses Data Mining 1. Pengantar Data Mining 3 Course Outline

8. Text Mining 7.1 Text Mining Concepts 7.2 Text Clustering 7.3 Text Classification 4

7.1 Text Mining Concepts 5

The fundamental step in text mining involves converting text into semi-structured data Once you convert the unstructured text into semi-structured data, there is nothing to stop you from applying any of the analytics techniques to classify, cluster, and predict The unstructured text needs to be converted into a semi-structured dataset so that you can find patterns and even better, train models to detect patterns in new and unseen text 6 How Text Mining Works

7 Text Processing

1. Himpunan Data (Pemahaman dan Pengolahan Data) 2. Metode Data Mining (Pilih Metode Sesuai Karakter Data) 3. Pengetahuan (Pola/Model/Rumus/ Tree/Rule/Cluster) 4. Evaluation (Akurasi, AUC, RMSE, Lift Ratio,…) 8 Proses Data Mining DATA PRE-PROCESSING Data Cleaning Data Integration Data Reduction Data Transformation Text Processing Estimation Prediction Classification Clustering Association

Words are separated by a special character: a blank space Each word is called a token The process of discretizing words within a document is called tokenization For our purpose here, each sentence can be considered a separate document, although what is considered an individual document may depend upon the context For now, a document here is simply a sequential collection of tokens 9 Word, Token and Tokenization

We can impose some form of structure on this raw data by creating a matrix, where: the columns consist of all the tokens found in the two documents the cells of the matrix are the counts of the number of times a token appears Each token is now an attribute in standard data mining parlance and each document is an example 10 Matrix of Terms

Basically, unstructured raw data is now transformed into a format that is recognized, not only by the human users as a data table, but more importantly by all the machine learning algorithms which require such tables for training This table is called a document vector or term document matrix (TDM) and is the cornerstone of the preprocessing required for text mining 11 Term Document Matrix (TDM)

We could have also chosen to use the TF–IDF scores for each term to create the document vector N is the number of documents that we are trying to mine N k is the number of documents that contain the keyword, k 12 TF–IDF

In the two sample text documents was the occurrence of common words such as “a,” “this,” “and,” and other similar terms Clearly in larger documents we would expect a larger number of such terms that do not really convey specific meaning Most grammatical necessities such as articles, conjunctions, prepositions, and pronouns may need to be filtered before we perform additional analysis Such terms are called stopwords and usually include most articles, conjunctions, pronouns, and prepositions Stopword filtering is usually the second step that follows immediately after tokenization Notice that our document vector has a significantly reduced size after applying standard English stopword filtering 13 Stopwords

Lakukan googling dengan keyword: stopwords bahasa Indonesia Download stopword bahasa Indonesia dan gunakan di Rapidminer 14 Stopwords Bahasa Indonesia

Words such as “recognized,” “recognizable,” or “recognition” in different usages, but contextually they may all imply the same meaning, for example: “Einstein is a well-recognized name in physics” “The physicist went by the easily recognizable name of Einstein” “Few other physicists have the kind of name recognition that Einstein has” The so-called root of all these highlighted words is “recognize” By reducing terms in a document to their basic stems, we can simplify the conversion of unstructured text to structured data because we now only take into account the occurrence of the root terms This process is called stemming. The most common stemming technique for text mining in English is the Porter method (Porter, 1980) 15 Stemming

16 A Typical Sequence of Preprocessing Steps to Use in Text Mining

There are families of words in the spoken and written language that typically go together The word “Good” is usually followed by either “Morning,” “Afternoon,” “Evening,” “Night,” or in Australia, “Day” Grouping such terms, called n-grams, and analyzing them statistically can present new insights Search engines use word n-gram models for a variety of applications, such as: Automatic translation, identifying speech patterns, checking misspelling, entity detection, information extraction, among many different use cases 17 N-Grams

18 Rapidminer Process of Text Mining

7.2 Text Clustering 19

Lakukan eksperimen mengikuti buku Matthew North (Data Mining for the Masses) Chapter 12 (Text Mining), p Datasets: Federalist Papers Pahami alur text mining yang dilakukan dan sesuaikan dengan konsep yang sudah dipelajari 20 Latihan

Gillian is a historian and archivist, and she has recently curated an exhibit on the Federalist Papers, the essays that were written and published in the late 1700’s The essays were published anonymously under the author name ‘Publius’, and no one really knew at the time if ‘Publius’ was one individual or many Years later, after Alexander Hamilton died in the year 1804, some notes were discovered that revealed that he (Hamilton), James Madison and John Jay had been the authors of the papers The notes indicated specific authors for some papers, but not for others. Specifically, John Jay was revealed to be the author for papers 3, 4 and 5; Madison for paper 14; and Hamilton for paper 17. Paper 18 had no author named, but there was evidence that Hamilton and Madison worked on that one together Gillian would like to analyze paper 18’s content in the context of the other papers with known authors, to see if she can generate some evidence that the suspected collaboration between Hamilton and Madison is in fact a likely scenario Having studied all of the Federalist Papers and other writings by the three statesmen who wrote them, Gillian feels confident that paper 18 is a collaboration that John Jay did not contribute to—his vocabulary and grammatical structure was quite different from those of Hamilton and Madison Business Understanding

Gillian is a historian and archivist, and she has recently curated an exhibit on the Federalist Papers, the essays that were written and published in the late 1700’s The essays were published anonymously under the author name ‘Publius’, and no one really knew at the time if ‘Publius’ was one individual or many Years later, after Alexander Hamilton died in the year 1804, some notes were discovered that revealed that he (Hamilton), James Madison and John Jay had been the authors of the papers The notes indicated specific authors for some papers, but not for others. Specifically, John Jay was revealed to be the author for papers 3, 4 and 5; Madison for paper 14; and Hamilton for paper 17. Paper 18 had no author named, but there was evidence that Hamilton and Madison worked on that one together Gillian would like to analyze paper 18’s content in the context of the other papers with known authors, to see if she can generate some evidence that the suspected collaboration between Hamilton and Madison is in fact a likely scenario Having studied all of the Federalist Papers and other writings by the three statesmen who wrote them, Gillian feels confident that paper 18 is a collaboration that John Jay did not contribute to—his vocabulary and grammatical structure was quite different from those of Hamilton and Madison Data Understanding

23

Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining), Case Study 1: Keyword Clustering, p Datasets: Gunakan stopword Bahasa Indonesia (download dari Internet) 24 Latihan

25

26

27

7.3 Text Classification 28

Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining), Case Study 2: Predicting the Gender of Blog Authors, p Datasets: blog-gender-dataset.xslx Split Data: 50% data training dan 50% data testing Gunakan algoritma Naïve Bayes Apply model yang dihasilkan untuk data testing Ukur performance nya 29 Latihan

Dengan berbagai konsep dan teknik yang anda kuasai, lakukan text classification pada dataset polarity data - small Ambil 1 artikel di dalam folder pos, uji apakah artikel tersebut termasuk sentiment negative atau positive 30 Latihan

31

Dengan berbagai konsep dan teknik yang anda kuasai, lakukan text classification pada dataset polarity data Terapkan beberapa metode feature selection, baik filter maupun wrapper Lakukan komparasi terhadap berbagai algoritma klasifikasi, dan pilih yang terbaik 32 Latihan

33

34

35

36

1.Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Third Edition, Elsevier, Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical Machine Learning Tools and Techniques 3rd Edition, Elsevier, Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics Applications, CRC Press Taylor & Francis Group, Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT Press, Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, Referensi