Rom Words to Pictures: Text Analysis and Visualization Nicholas Diakopoulos Computational Journalism Lab – College of Journalism University of Maryland.

Slides:



Advertisements
Similar presentations
HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.
Picking the NYT Picks: Editorial Criteria and Automation in the Curation of Online News Comments Nicholas Diakopoulos University of Maryland, College Park.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Advanced Technical Writing 2008 Session #2. Web Space? You have access to an account provided by MSU– your AFS Space.
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library Duke University Libraries,
Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.
Definition of a References List A References list is an alphabetized listing of all the “recoverable data” you directly mentioned in your paper. If a.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Visual Perspectives iPLANT Visual Analytics Workshop November 5-6, 2009 ;lk Visual Analytics Bernice Rogowitz Greg Abram.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
The nature of Texts: The ins and out of your folio CONTEXT CONTEXT CONTEXT.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
DC AAPOR Summer Conference, Washington DC June 21-22, 2012 Casey Langer Tesfaye American Institute of Physics Georgetown University Free Range Research.
Parts of Speech Major source: Wikipedia. Adjectives An adjective is a word that modifies a noun or a pronoun, usually by describing it or making its meaning.
Information Retrieval
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Richard Marciano Professor, University of Maryland iSchool Affiliate Professor, Computer Science Director, Digital Curation Innovation Center (DCIC) University.
DO NOW: PLEASE GRAB YOUR PERSONAL CULTURE MAP AND ADD TO UNIT II IN YOUR BINDER! ALSO, PLEASE CLEAR OUT YOUR BOX!
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
A Simple Approach for Author Profiling in MapReduce
Measuring Monolinguality
Sentiment analysis algorithms and applications: A survey
Future-oriented Benchmarking Through Social Media Analysis
Petr Knoth & Nancy Pontika CORE The Open
Keywords the words (or n word sequences) which are significantly more frequent in a specialised corpus than in a "reference corpus" generally, the reference.
Computational and Statistical Methods for Corpus Analysis: Overview
How Facebook Talk Informs Us About Current Word Use
Topics in Linguistics ENG 331
Introduction to Corpus Linguistics: Key Word Analysis
Corpus Linguistics I ENG 617
University of Computer Studies, Mandalay
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Hey Social Media Ambassador!
How and when to use them! L. Wheater 2012.
Composing with Images and Words Using Web. 2.0 Tools
English Teaching Sequence
Multi-Dimensional Data Visualization
Lecture 20 Ying Zhu Georgia State University
BCMS TORCH Book Study.
BCMS TORCH Book Study.
Sentiment/opinion analysis
BCMS TORCH Book Study.
Inclusive Communication Hub
Text Mining & Natural Language Processing
Chapter 2 What speakers know.
Statistical n-gram David ling.
Text Mining & Natural Language Processing
CHAPTER 7: Information Visualization
Introduction to Text Analysis
PolyAnalyst Web Report Training
to the Camden Early Help Friends Workshop
Wildlife Monitoring Publish-Subscribe
“I Can” Learning Targets
Introduction to Sentiment Analysis
Text Analytics Solutions with Azure Machine Learning
Presented By: Grant Glass
Exploring Cognitive Services
Presentation transcript:

rom Words to Pictures: Text Analysis and Visualization Nicholas Diakopoulos Computational Journalism Lab – College of Journalism University of Maryland

What’s Different about Text? Text A sequence of written or spoken words Frequencies / rates, context, semantics Tables Geometric (2D or 3D) Networks (graphs) Trees (hierarchies) Temporal Image Credit: T. Munzner. Visualization Analysis & Design

Counts + Comparison

Counts Over Time + Semantics

Counts + Maps

Diakopoulos et al. Diamonds in the rough: Social media visual analytics for journalistic inquiry. VAST 2010.

Networks Networks of Names: Visual Exploration and Semi-Automatic: Tagging of Social Networks from Newspaper Articles. EuroVis 2014.

Wordles

N. Diakopoulos, et al. Compare Clouds: Visualizing Text Corpora to Compare Media Frames. Proc. IUI Workshop on Visual Text Analytics

Word Tree

News Views T. Gao, J. Hullman, E. Adar, B. Hecht, N. Diakopoulos. NewsViews: An Automated Pipeline for Creating Custom Geovisualizations for News. Proc. Conference on Human Factors in Computing Systems (CHI). May, 2014.

Timeline Curator

Processing Text How do we go from a blob of text to something we can actually work with? What can we count? What tools can we use?

Text Processing Pipeline Stop Word Removal we | are | fifteen | year | into | new | centuri Stem we | are | fifteen | year | into | thi | new | centuri |. Tokenize we | are | fifteen | years | into | this | new | century |. Lowercase we are fifteen years into this new century. Initial Text We are fifteen years into this new century.

Pipeline Pointers Lowercasing Usually it’s ok, but sometimes capitals matter, e.g. in peoples titles Tokenization If tokenizing sentences, you need to be careful for things like “Mr. Speaker, Mr. Vice President” Stemming Is language specific May need reverse-stemming to be presentable back to the user

Counting Stuff AntConc Unigrams Bigrams N-Grams Collocations Regular Expressions “keyness” Classes

Linguistic Resources Linguistic Inquiry and Word Count Dictionaries for: affective words (pos emotions, neg emotions); perceptual processes (see, hear, feel); biological processes (health, sex); work; leisure; death; religion; family & friends General Inquirer Dictionaries for: pleasure; pain, arousal; virtue; vice; economics; legal; military; political, etc etc etc BUT, dictionaries aren’t adapted to domain, to slang or informal language etc.

Advanced Analysis Part of Speech Tagging Count interesting things like superlatives, comparatives, prepositions, pronouns Use of word, e.g. “combat” N or V?

Advanced Analysis Named Entity Extraction Identify people, places, organizations as such Tough b/c of ambiguity in text, “Athens” GA or Greece? New names always coming into existence so dictionary lookup doesn’t extend well

Alchemy API

Putting it Together Let’s Look at Sage Math Cloud & Some Python: a300-30a07b99db8f/ To get the Ipython Notebook:

Questions? Computational Journalism Lab College of Journalism University of Maryland Contact Nick Diakopoulos Web: We are hiring fellows to work on computational journalism projects – please find me to discuss more.