Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Sources – Web, Social media and Text Analytics

Similar presentations


Presentation on theme: "Big Data Sources – Web, Social media and Text Analytics"— Presentation transcript:

1 Big Data Sources – Web, Social media and Text Analytics
Piet Daas, Olav ten Bosch, Ali Hürriyetoglu, Dick Windmeijer THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

2 ESTP Big Data training course nr. 3
Overview Hands on (learning by doing) Learn how to: Collect ‘data’ – from Web pages and Social media Process ‘data’ Analyse ‘data’ Learn how to extract information from textual data - Text mining, text analytics, Natural Language Processing …

3 Overview Day 1 Introduction Social media and official statistics
Exercise: Create ‘keys’ for Twitter API access Exercise: Connect to Twitter API Exercise: Get user, profile and tweets (in your own language) 3

4 Overview (2) Day 2 Day 3 Web scraping explained
Exercise: Use web robots Web scraping tips and tricks Exercise: Learn how to collect data from websites Feedback Day 3 Text mining and topic identification of tweets Exercise: Analyse tweets: identify topics Sentiment analysis Exercise: Analyse tweets: sentiment & more Natural Language Processing Demonstration Exercise: Extra time for more advanced analysis 4

5 Overview (3) Day 4 Text mining of web pages
Exercise: Analyse document: content Exercise: Analyse web sites: content & topics Overview of the course & dealing with private data Exercise: Time to redo exercises/extra work Feedback Wrapping up, removing data 5

6 Why analyse text? Texts are a source of information not commonly used in official statistics Potential applications are, automatically: Classify answers to open questions Code description of jobs/educations/products Identify activity code of companies from web site text Detailed product identification from descriptions on web sites Classify cause of death from medical reports Sentiment analysis of messages

7 Why analyse text? (2) It is therefore important to:
Learn how to extract information from textual data This training course will focus on this topic Goal is to learn the basics by a hands-on approach Is a starting-point for more advanced studies Key steps are: collection, processing and analysis Obtain insights in methods and approaches that can be applied to extract information from texts

8 Examples of interesting books
Manning (1999). Foundations of Statistical Natural Language Processing. MIT Press. Feldman and Sanger (2007) The Text Mining Handbook, Cambridge Univ. Press. Kao, Poteet (2007) Natural Language Processing and Text Mining, Springer Manning, Raghavan and Schütze (2008) Introduction to Information Retrieval, Cambridge Univ. Press. Weiss, Indurkhya, Zhang (2010) Fundamentals of Predictive Text Mining, Springer Aggarwal, Zhai (2012) Mining Text Data, Springer  Miner, Elder, Fast, Hill, Nisbet, Delen (2012) Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Elsevier

9 Practical tips Use our laptops Dual boot Windows / Linux
Need to collect your own data! Connect to WiFi (CBS-Public) Web robots: via browser plugin (Windows) Twitter data: either in R or in Python (Linux) Python Notebooks will be distributed

10 R-packages for text analytics
tm: Text Mining Package A framework for text mining applications within R NLP: Natural Language Processing Infrastructure Basic classes and methods for Natural Language Processing SnowballC: Snowball ‘stemmers’ … An R interface to the C libstemmer library … Currently supported languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. stringr: Wrappers for Common String Operations A consistent, simple and easy to use set of wrappers for string operations. wordcloud: Word Clouds For pretty word clouds RColorBrewer: ColorBrewer Palettes Provides color schemes for maps (and other graphics) twitteR: R Based Twitter Client Provides an interface to the Twitter web API More info:

11 Text analytics libraries for Python
NLTK: Natural Language toolkit Collection of NLP tools TextBlob Built on top of NLTK, especially useful for beginners spaCy Fast NLP implementation Gensim For topic modeling and similarity detection Pattern Web mining module for Python and more Pyparsing For parsing text

12 Essential step for Twitter studies

13 Create keys for Twitter API access
Make sure you have a Twitter account If not, go to Login and visit Fill in a name, description, web site and agree Copy all keys and tokens (all four), paste them in a text file and save this!! (don’t share them) You will need them during this course!!

14 14


Download ppt "Big Data Sources – Web, Social media and Text Analytics"

Similar presentations


Ads by Google