Introduction to Text Mining Hongning Wang

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Introduction to Text Mining
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Information Retrieval in Practice
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Big Data Research Progress Chao Jan 22, Big Data Lab Big MIT – – 23 nodes.
CS6501 Information Retrieval Course Policy Hongning Wang
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Overview of IR Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Introduction to Information Retrieval Hongning Wang
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
CS598CXZ (CS510) Advanced Topics in Information Retrieval (Fall 2014) Instructor: ChengXiang (“Cheng”) Zhai 1 Teaching Assistants: Xueqing Liu, Yinan Zhang.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Introduction to Information Retrieval
Decision Support Systems
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Text Clustering Hongning Wang
PhD Dissertation Defense Scaling Up Machine Learning Algorithms to Handle Big Data BY KHALIFEH ALJADDA ADVISOR: PROFESSOR JOHN A. MILLER DEC-2014 Computer.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.
CSCE 5073 Section 001: Data Mining Spring Overview Class hour 12:30 – 1:45pm, Tuesday & Thur, JBHT 239 Office hour 2:00 – 4:00pm, Tuesday & Thur,
Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Machine Learning. Definition Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational.
Aakarsh Malhotra ( ) Gandharv Kapoor( )
Introducing Precictive Analytics
Term Project Proposal By J. H. Wang Apr. 7, 2017.
CS510 Advanced Topics in Information Retrieval (Fall 2017)
Taking a Tour of Text Analytics
Machine Learning overview Chapter 18, 21
Introduction to Information Retrieval
Introduction to IR Research
中国计算机学会学科前沿讲习班:信息检索 Course Overview
CS598CXZ (CS510) Advanced Topics in Information Retrieval (Fall 2016)
Course Summary (Lecture for CS410 Intro Text Info Systems)
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Overview of IR Research
CS510 (Fall 2018) Advanced Topics in Information Retrieval
Data Warehousing and Data Mining
CSE 635 Multimedia Information Retrieval
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Introduction to Information Retrieval
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Web Mining Department of Computer Science and Engg.
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Welcome! Knowledge Discovery and Data Mining
CSCE 4143 Section 001: Data Mining Spring 2019.
Introduction to Artificial Intelligence
Presentation transcript:

Introduction to Text Mining Hongning Wang

What is “Text Mining”? “Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.” - wikipedia “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” - Hearst, 1999 Text Mining2

Two different definitions of mining Goal-oriented (effectiveness driven) – Any process that generates useful results that are non- obvious is called “mining”. – Keywords: “useful” + “non-obvious” – Data isn’t necessarily massive Method-oriented (efficiency driven) – Any process that involves extracting information from massive data is called “mining” – Keywords: “massive” + “pattern” – Patterns aren’t necessarily useful Text Mining

Knowledge discovery from text data IBM’s Watson wins at Jeopardy! Text Mining4

An overview of Watson Text Mining5

What is inside Watson? “Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage including the full text of Wikipedia” – PC World “The sources of information for Watson include encyclopedias, dictionaries, thesauri, newswire articles, and literary works. Watson also used databases, taxonomies, and ontologies. Specifically, DBPedia, WordNet, and Yago were used.” – AI Magazine Text Mining6

What is inside Watson? DeepQA system – “Watson's main innovation was not in the creation of a new algorithm for this operation but rather its ability to quickly execute hundreds of proven language analysis algorithms simultaneously to find the correct answer.” – New York Times – The DeepQA Research Team The DeepQA Research Team Text Mining7

Text mining around us Sentiment analysis Text Mining8

Text mining around us Sentiment analysis Text Mining9

Text mining around us Document summarization Text Mining10

Text mining around us Document summarization Text Mining11

Text mining around us Movie recommendation Text Mining12

Text mining around us Restaurant/hotel recommendation Text Mining13

Text mining around us News recommendation Text Mining14

Text mining around us Text analytics in financial services Text Mining15

Text mining around us Text analytics in healthcare Text Mining16

How to perform text mining? As computer scientists, we view it as – Text Mining = Data Mining + Text Data Applied machine learning Natural language processing Information retrieval s Blogs News articles Web pages Tweets Scientific literature Software documentations Text Mining17

Text mining v.s. NLP, IR, DM… How does it relate to data mining in general? How does it relate to computational linguistics? How does it relate to information retrieval? Finding PatternsFinding “Nuggets” NovelNon-Novel Non-textual data General data-mining Exploratory data analysis Database queries Textual data Computational Linguistics Information retrieval Text Mining Text Mining18

Text mining in general Text Mining19 AccessMining Organization Filter information Discover knowledge Add Structure/Annotations Serve for IR applications Based on NLP/ML techniques Sub-area of DM research

Challenges in text mining Data collection is “free text” – Data is not well-organized Semi-structured or unstructured – Natural language text contains ambiguities on many levels Lexical, syntactic, semantic, and pragmatic – Learning techniques for processing text typically need annotated training examples Expensive to acquire at scale What to mine? Text Mining

Text mining problems we will solve Document categorization – Adding structures to the text corpus Text Mining21

Text mining problems we will solve Text clustering – Identifying structures in the text corpus Text Mining22

Text mining problems we will solve Topic modeling – Identifying structures in the text corpus Text Mining23

Text mining problems we will solve Social media and network analysis – Exploring additional structure in the text corpus Text Mining24

We will also briefly cover Natural language processing pipeline – Tokenization “Studying text mining is fun!” -> “studying” + “text” + “mining” + “is” + “fun” + “!” – Part-of-speech tagging “Studying text mining is fun!” -> – Dependency parsing “Studying text mining is fun!” -> Text Mining25

We will also briefly cover Machine learning techniques – Supervised methods Naïve Bayes, k Nearest Neighbors, Logistic Regression – Unsupervised methods K-Means, hierarchical clustering, topic models – Semi-supervised methods Expectation Maximization Text Mining26

Text mining in the era of Big Data Huge in size – Google processes 5.13B queries/day (2013) – Twitter receives 340M tweets/day (2012) – Facebook has 2.5 PB of user data + 15 TB/day (4/2009) – eBay has 6.5 PB of user data + 50 TB/day (5/2009) 80% data is unstructured (IBM, 2010) 640K ought to be enough for anybody. Text Mining27

Scalability is crucial Large scale text processing techniques – MapReduce framework Text Mining28

State-of-the-art solutions Apache Spark (spark.apache.org)spark.apache.org – In-memory MapReduce Specialized for machine learning algorithms – Speed 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Text Mining29

State-of-the-art solutions Apache Spark (spark.apache.org)spark.apache.org – In-memory MapReduce Specialized for machine learning algorithms – Generality Combine SQL, streaming, and complex analytics Text Mining30

State-of-the-art solutions GraphLab (graphlab.com)graphlab.com – Graph-based, high performance, distributed computation framework Text Mining31

State-of-the-art solutions GraphLab (graphlab.com)graphlab.com – Specialized for sparse data with local dependencies for iterative algorithms Text Mining32

Text mining in the era of Big Data 33 Knowledge Discovery Decision Support Text data Data Generation Modeling Human-generated data Behavior data Knowledge service system Human: big data producer and consumer As data producer Challenges: 1.Unstructured data 2.Rich semantic As knowledge consumer Challenges: 1.Implicit feedback 2.Diverse and dynamic Text Mining

Text books Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, Cambridge University Press, Speech and Language Processing. Daniel Jurafsky and James H. Martin, Pearson Education, Mining Text Data. Charu C. Aggarwal and ChengXiang Zhai, Springer, Text Mining34

Text Mining What to read? Library & Info Science Machine Learning Pattern Recognition Web Applications, Bioinformatics… Statistics Optimization Applications Information Retrieval SIGIR, WWW, WSDM, CIKM ICML, NIPS, UAI NLP ACL, EMNLP, COLING Data Mining KDD, ICDM, SDM Find more on course website for resource Text Mining35 Algorithms

Welcome to the class of “Text Mining”! Text Mining36