Big Data Sources – Web, Social media and Text Analytics

Slides:



Advertisements
Similar presentations
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Advertisements

Olav ten Bosch MSIS, Dublin, April 2014 On the use of internet robots for official statistics.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
J.Evans LiveBinders SCPS Technology Conference. What is LiveBinder? Your 3 ring binder for the web. ✤ Collect your resources ✤ Organize them neatly and.
Peak Net Ltd Unit 7, Rock Mill Business Park, Stoney Middleton, Derbyshire S32 4TF // // Creating Your.
Linkedin “Your Professional Networking Hub”. What is linkedin Linkedin is a social networking website for professionals. It’s highly homogenous with most.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
Chapter 4 Planning Site Navigation Principles of Web Design, 4 th Edition.
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
COURSE OVERVIEW ADVANCED TEXT ANALYTICS Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Overview of REALNEO Technologies REALNEO Web Platform Architecture Overview of Drupal.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Kelly rowland WHAT WE ALL NEED!!. hoppadon formly of village deuce mafia...the hottest rap don spitting!!
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
5 Marzo 2007 Census mapping and Gis Part II: dissemination Fabio Crescenzi Istat, Central Directorate on General Censuses UNECE Training Workshop on Census.
Jessica Martin. The name of the product I am choosing to research is Wikipedia.
CSE 4481 Computer Security Lab Mark Shtern. INTRODUCTION.
Content Mgmt Services eText Overview Digital Delivery Aug 7, 2012.
Week 4 Planning Site Navigation. 2 Creating Usable Navigation Provide enough location information to let the user answer the following navigation questions:
Leximancer Tijana Husić Textual content analysis tool.
University of Limerick1 Computer Applications CS 4815 Robocode.
© 2009 AccuWeather, Inc. Proprietary1. 2 Weather content around the globe. Dan Ryan New Media Sales
Introduction to Taverna Online and Interaction service Aleksandra Pawlik University of Manchester.
Text Mining Supplemental Resources on Class Website.
© Copyright 2008 STI INNSBRUCK TrustYou Ioan Toma.
FriendFinder Location-aware social networking on mobile phones.
Report Sharp-Shooter – is the most flexible reporting component for is the most flexible reporting component for.NET. The product provides a wide range.
Project Management Software - ProofHub ProofHub is a web based collaboration and project management software built to help individuals and organizations.
GBIF NODES Committee Meeting Copenhagen, Denmark 4 th October 2009 The GBIF Integrated Publishing Toolkit Alberto GONZÁLEZ-TALAVÁN Programme Officer for.
LECTURE 6 Natural Language Processing- Practical.
Stata tweets and other API libraries: a practical guide William Matsuoka Stata Conference Chicago, IL - July 28, 2016.
What is Dental-Cal? Global, multilingual, free to use dental event calendar Target group: Dentists, Dental Technicians, Dental Assistants.
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
Gravostyle 8 Detailed Launch Presentation
Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.
Introduction to gathering and analyzing data via APIs Gus Cavanaugh
Introduction to working with Weebly
How to use.
Online Educational tool #2 and #3
Product Introduction MX-SW310 - Job Accounting II
Why API?.
Google Web Toolkit Tutorial
Overview of REALNEO Technologies
Evolution of Internet.
Sales Presenter Available now
Data Synthesis and Analysis
Introduction to Advance Web Technologies
PHP / MySQL Introduction
COMPSCI 111 / 111G An introduction to practical computing
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Text Analytics Giuseppe Attardi Università di Pisa
Technology Now, 2nd Edition Getting Started.
Web scraping tools, an introduction

Trust and Culture on the Web
Piet Daas, Ali Hürriyetoglu

This module Provides some tips for data management
Uses of web scraping for official statistics
Text Mining & Natural Language Processing
Text Mining & Natural Language Processing
AGMLAB Information Technologies
WEB DESIGNING THROUGH HTML
Sales Presenter Available now Standard v Slim
Web scraping tools, a real life application
Tokenizing Search/regex Statistics
Web Application Development Using PHP

Standardizing and industrializing a business process – the dissemination use case Alessio Cardacino - ESTP Course “Information standards.
DIBBs Brown Dog BDFiddle
Presentation transcript:

Big Data Sources – Web, Social media and Text Analytics Piet Daas, Olav ten Bosch, Ali Hürriyetoglu, Dick Windmeijer THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

ESTP Big Data training course nr. 3 Overview Hands on (learning by doing) Learn how to: Collect ‘data’ – from Web pages and Social media Process ‘data’ Analyse ‘data’ Learn how to extract information from textual data - Text mining, text analytics, Natural Language Processing …

Overview Day 1 Introduction Social media and official statistics Exercise: Create ‘keys’ for Twitter API access Exercise: Connect to Twitter API Exercise: Get user, profile and tweets (in your own language) 3

Overview (2) Day 2 Day 3 Web scraping explained Exercise: Use web robots Web scraping tips and tricks Exercise: Learn how to collect data from websites Feedback Day 3 Text mining and topic identification of tweets Exercise: Analyse tweets: identify topics Sentiment analysis Exercise: Analyse tweets: sentiment & more Natural Language Processing Demonstration Exercise: Extra time for more advanced analysis 4

Overview (3) Day 4 Text mining of web pages Exercise: Analyse document: content Exercise: Analyse web sites: content & topics Overview of the course & dealing with private data Exercise: Time to redo exercises/extra work Feedback Wrapping up, removing data 5

Why analyse text? Texts are a source of information not commonly used in official statistics Potential applications are, automatically: Classify answers to open questions Code description of jobs/educations/products Identify activity code of companies from web site text Detailed product identification from descriptions on web sites Classify cause of death from medical reports Sentiment analysis of messages …

Why analyse text? (2) It is therefore important to: Learn how to extract information from textual data This training course will focus on this topic Goal is to learn the basics by a hands-on approach Is a starting-point for more advanced studies Key steps are: collection, processing and analysis Obtain insights in methods and approaches that can be applied to extract information from texts

Examples of interesting books Manning (1999). Foundations of Statistical Natural Language Processing. MIT Press. Feldman and Sanger (2007) The Text Mining Handbook, Cambridge Univ. Press. Kao, Poteet (2007) Natural Language Processing and Text Mining, Springer Manning, Raghavan and Schütze (2008) Introduction to Information Retrieval, Cambridge Univ. Press. Weiss, Indurkhya, Zhang (2010) Fundamentals of Predictive Text Mining, Springer Aggarwal, Zhai (2012) Mining Text Data, Springer  Miner, Elder, Fast, Hill, Nisbet, Delen (2012) Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Elsevier

Practical tips Use our laptops Dual boot Windows / Linux Need to collect your own data! Connect to WiFi (CBS-Public) Web robots: via browser plugin (Windows) Twitter data: either in R or in Python (Linux) Python Notebooks will be distributed

R-packages for text analytics tm: Text Mining Package A framework for text mining applications within R NLP: Natural Language Processing Infrastructure Basic classes and methods for Natural Language Processing SnowballC: Snowball ‘stemmers’ … An R interface to the C libstemmer library … Currently supported languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. stringr: Wrappers for Common String Operations A consistent, simple and easy to use set of wrappers for string operations. wordcloud: Word Clouds For pretty word clouds RColorBrewer: ColorBrewer Palettes Provides color schemes for maps (and other graphics) twitteR: R Based Twitter Client Provides an interface to the Twitter web API More info: https://cran.r-project.org/package=<name_package>

Text analytics libraries for Python NLTK: Natural Language toolkit Collection of NLP tools TextBlob Built on top of NLTK, especially useful for beginners spaCy Fast NLP implementation Gensim For topic modeling and similarity detection Pattern Web mining module for Python and more Pyparsing For parsing text

Essential step for Twitter studies

Create keys for Twitter API access Make sure you have a Twitter account If not, go to https://twitter.com/signup Login and visit https://apps.twitter.com/app/new Fill in a name, description, web site and agree Copy all keys and tokens (all four), paste them in a text file and save this!! (don’t share them) You will need them during this course!!

14