Text Mining & Natural Language Processing

Slides:



Advertisements
Similar presentations
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Problem Semi supervised sarcasm identification using SASI
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Mining and Summarizing Customer Reviews
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
CSC 594 Topics in AI – Text Mining and Analytics
Information Retrieval
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Introduction to Computer Programming using Fortran 77.
2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Language Identification and Part-of-Speech Tagging
Event Detection and Opinion Mining
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Like It or Not: A Survey of Twitter Sentiment Analysis Methods
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March.
Measuring Monolinguality
Taking a Tour of Text Analytics
Sentiment analysis algorithms and applications: A survey
Text Based Information Retrieval
Introduction to Unified Modeling Language (UML)
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Memory Standardization
Natural Language Processing (NLP)
University of Computer Studies, Mandalay
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Sentiment Analysis Study
MID-SEM REVIEW.
Aspect-based sentiment analysis
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Social Knowledge Mining
Machine Learning in Natural Language Processing
Writing Analytics Clayton Clemens Vive Kumar.
CS 430: Information Discovery
Introduction Task: extracting relational facts from text
Seminar Topics and Projects
Automatic Detection of Causal Relations for Question Answering
Chapter 7 Lexical Analysis and Stoplists
Text Mining & Natural Language Processing
Big Data Sources – Web, Social media and Text Analytics
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
CS246: Information Retrieval
Introduction to Text Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
Natural Language Processing (NLP)
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Piet Daas, Ali Hürriyetoglu
Introduction to Sentiment Analysis
From Unstructured Text to StructureD Data
Big Data Big Data first appeared towards the end of the 1990’s and has become a buzz word in the last few years.
Natural Language Processing (NLP)
Presentation transcript:

Text Mining & Natural Language Processing Ali Hürriyetoglu, Piet Daas THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Outline Introduction Background Basic steps Use cases Machine learning for text mining

Introduction

What can you do with text mining? Named entity recognition Sentiment analysis Topic detection Information extraction Trend detection Clustering similar documents Automatic summarisation

Ingredients of text mining Text analytics is a function of: The amount and type of text you have The task you want to achieve The precision and recall you want to get The time you can spend

Text types Semi structured language use: Address, phone number, named entities, etc. Standard text: News articles, books, etc. User generated text: social media, comments

Background

Text Text is a rich combination of symbols that lead to a structure which has a context dependent interpretation. Symbols: character, word, punctuation, digit, emoticon Structure: tokens, links, user names, hashtags, noun, verb, named entity, emoticon, phrases, codes, etc. Context: writer, genre, platform, social environment, time, geographic location, etc. Interpretation: sense, meaning, …

Symbols Letters: A B Ç X Digits: 1 5 3 2 Punctuation: . , ! ? Emoticons:   Special characters: ^ # &

Structure Tokens: Any space separated symbol sequence (for European languages). Numbers: 6, 123, …, Web specific tokens: user names, hashtags, URLs, … Abbreviations: vs., etc., ... Syntactic interpretation: noun, verb, adjective, ...

Context Anything about use of a token may have significant effect: The person who uses it The aim of the phrase Time and place of the language use Preceding and following expressions ...

Interpretation Tokens and phrases may have one or more interpretations. Ambiguity: Lexical meaning may differ Named entities: same entities names may refer to different real entities Genre: Orders, compliments, statements, instructions, etc. Usernames: will be interpreted differently in different platforms

Basic steps

Basic steps and tools You need some combination of: Language identification Sentence splitting Tokenization Lemmatization Anaphora resolution Regular expressions POS tagging Named entity recognition Parsing methodology, Pyparsing Language resources: stop words, a sentiment lexicon, multi-word expressions, ontology, etc.

Use cases

Named entities Problem: You want to know which named entities are available in a text. You do not have much time or resources. An approximate result is sufficient for you. Solution: Find and count all proper-cased token sequences: ([A-Z][a-z]+(\s[A-Z][a-z]+)+) ('Sherlock Holmes', 90), ('United States', 71), ('New York', 54), ('New England', 46), ('Baker Street', 29), …

Street names Problem: You have a set of criminality reports. You wonder which street names are mentioned mostly. Solution: Write a more specific regular expression: [A-Z][a-z]+ [sS]treet ('Baker Street', 29), ('Leadenhall Street', 5), ('Fresno Street', 2), ('Fenchurch Street', 2), ('Bow Street', 2), ('Oxford Street', 2), …

Detect economic indicators Problem: You want to detect and track price changes. You want to be precise. You know and can spend some time to specify what you are looking for. Solution: Parse text with Pyparsing* action = oneOf(["lower","increase","decrease"], caseless=True) econ = oneOf(["prices","expense","cost","price"], caseless=True) item = Word(alphas) economy_grammar = action("action")+item("item")+econ economy_grammar2 = econ + Literal("of") + item + action *For R use tm package

Sentiment Analysis Problem: You want to understand how people feel about a certain issue or entity. Solution 1: Create or use an available sentiment lexicon. Count number of occurrences for the entries in the lexicon. Solution 2: Detailed syntactic and semantic analysis.

Wordclouds Problem: You have text, and want to have a quick insight about what it mostly contains. Solution: Word cloud, streamgraph, t-SNE, …

https://github.com/amueller/word_cloud/blob/master/examples/constitution.png

Track co-evoluation of language use https://blog.twitter.com/2010/the-2010-world-cup-a-global-conversation

Topic modelling Problem: You need a detailed analysis of the topics in a text collection, corpus. Solution: Topic modelling

http://alexperrier. github http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html

Machine learning

Machine Learning You can attempt to solve almost any text mining task with machine learning approaches. The outcome will depend on: Feature extraction and selection Amount of labeled data in the case of supervised learning Time you have to analyze the output in unsupervised learning

Thanks for listening! Any question or comment?

Exercises 6) Search for key terms on Twitter and collect n tweets (n = 200) 7) Determine most frequent hashtags, links, mentions 8) Create wordcloud of these tweets 9) Topic detection from tweets (either user or key terms search result) 10) Sentiment analysis, create your own list of 10 positive and 10 negative words, calculate count based score 11) Look for an online classifier (for the language of your tweets), get access key and test it (watch the rate limit) E.g. MonkeyLearn 12) Study emoticons as an example for basic emotions 28

Additional exercises Additional tasks: 13) Detect place name, person name, organisation name, number, date recognition, geolocation/temporal characteristics, find similar tweets 14) Apply t-distributed stochastic neighbour embedding (t-SNE) visualization technique on tweets 29