Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining & Natural Language Processing

Similar presentations


Presentation on theme: "Text Mining & Natural Language Processing"— Presentation transcript:

1 Text Mining & Natural Language Processing
Ali Hürriyetoglu, Piet Daas THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

2 Outline Introduction Background Basic steps Use cases
Machine learning for text mining

3 Introduction

4 What can you do with text mining?
Named entity recognition Sentiment analysis Topic detection Information extraction Trend detection Clustering similar documents Automatic summarisation

5 Ingredients of text mining
Text analytics is a function of: The amount and type of text you have The task you want to achieve The precision and recall you want to get The time you can spend

6 Text types Semi structured language use: Address, phone number, named entities, etc. Standard text: News articles, books, etc. User generated text: social media, comments

7 Background

8 Text Text is a rich combination of symbols that lead to a structure which has a context dependent interpretation. Symbols: character, word, punctuation, digit, emoticon Structure: tokens, links, user names, hashtags, noun, verb, named entity, emoticon, phrases, codes, etc. Context: writer, genre, platform, social environment, time, geographic location, etc. Interpretation: sense, meaning, …

9 Symbols Letters: A B Ç X Digits: 1 5 3 2 Punctuation: . , ! ?
Emoticons:   Special characters: ^ # &

10 Structure Tokens: Any space separated symbol sequence (for European languages). Numbers: 6, 123, …, Web specific tokens: user names, hashtags, URLs, … Abbreviations: vs., etc., ... Syntactic interpretation: noun, verb, adjective, ...

11 Context Anything about use of a token may have significant effect:
The person who uses it The aim of the phrase Time and place of the language use Preceding and following expressions ...

12 Interpretation Tokens and phrases may have one or more interpretations. Ambiguity: Lexical meaning may differ Named entities: same entities names may refer to different real entities Genre: Orders, compliments, statements, instructions, etc. Usernames: will be interpreted differently in different platforms

13 Basic steps

14 Basic steps and tools You need some combination of:
Language identification Sentence splitting Tokenization Lemmatization Anaphora resolution Regular expressions POS (Part Of Speech) tagging Named entity recognition Parsing methodology, Pyparsing Language resources: stop words, a sentiment lexicon, multi-word expressions, ontology, etc.

15 Regular expressions A regular expression (regex or regexp for short) is a special text string for describing a search pattern. Examples Find q in the input string “q” Find q at the start of the input string “^q” Find q at the end the input string “q$” Find words with q or Q in it “[qQ]” Find a single digit in a string “[0-9]” Find words that start with ton “\bton\w*”

16 Regular expressions (2)
Regular expressions can become very ‘difficult to interpret’ Examples Match mobile phone numbers “(((\\+31|0|0031)6){1}[1-9]{1}[0-9]{7})” Select text inside XML brackets “</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>” Fortunately there are web pages available to assist you For testing: Quick sheet:

17 Use cases

18 Named entities Problem: You want to know which named entities are available in a text. You do not have much time or resources. An approximate result is sufficient for you. Solution: Find and count all proper-cased token sequences: ([A-Z][a-z]+(\s[A-Z][a-z]+)+) ('Sherlock Holmes', 90), ('United States', 71), ('New York', 54), ('New England', 46), ('Baker Street', 29),

19 Street names Problem: You have a set of criminality reports. You wonder which street names are mentioned mostly. Solution: Write a more specific regular expression: [A-Z][a-z]+ [sS]treet ('Baker Street', 29), ('Leadenhall Street', 5), ('Fresno Street', 2), ('Fenchurch Street', 2), ('Bow Street', 2), ('Oxford Street', 2),

20 Detect economic indicators
Problem: You want to detect and track price changes. You want to be precise. You know and can spend some time to specify what you are looking for. Solution: Parse text with Pyparsing action = oneOf(["lower","increase","decrease"], caseless=True) econ = oneOf(["prices","expense","cost","price"], caseless=True) item = Word(alphas) economy_grammar = action("action")+item("item")+econ economy_grammar2 = econ + Literal("of") + item + action

21 Sentiment Analysis Problem: You want to understand how people feel about a certain issue or entity. Solution 1: Create or use an available sentiment lexicons. Count number of occurrences for the entries in the lexicon. Solution 2: Detailed syntactic and semantic analysis.

22 Wordclouds Problem: You have text, and want to have a quick insight about what it mostly contains. Solution: Word cloud, streamgraph, t-SNE, …

23

24 Track co-evoluation of language use

25 Topic modelling Problem: You need a detailed analysis of the topics in a text collection, corpus. Solution: Topic modelling

26 http://alexperrier. github

27 Machine learning

28 Machine Learning You can attempt to solve almost any text mining task with machine learning approaches. The outcome will depend on: Feature extraction and selection Amount of labeled data in supervised learning Time you have to analyze the output in unsupervised learning

29 Thanks for listening! Any question or comment?

30 Exercises 6) Search for key terms on Twitter and collect n tweets (n = 200) 7) Determine most frequent hashtags, links, mentions 8) Create wordcloud of these tweets 9) Topic detection from tweets (either user or key terms search result) 10) Sentiment analysis, create your own list of 10 positive and 10 negative words, calculate count based score 11) Look for an online classifier (for the language of your tweets), get access key and test it (watch the rate limit) 12) Study emoticons as an example for basic emotions 30

31 Additional exercises Additional tasks:
13) Detect place name, person name, organisation name, number, date recognition, geolocation/temporal characteristics, find similar tweets 14) Apply t-distributed stochastic neighbour embedding (t-SNE) visualization technique on tweets 31


Download ppt "Text Mining & Natural Language Processing"

Similar presentations


Ads by Google