Introduction to Textual Analysis

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
Advertisements

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Scalable Text Mining with Sparse Generative Models
Science and Engineering Practices
Natural Language Understanding
Differentiating Instruction Using Lexile Measures and OSLIS Developing Targets for Student Success Module I.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Process Skill identify methods used by archaeologists, anthropologists, historians, and geographers to analyze evidence.[WHS.29A] October 2014WORLD HISTORY.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
INTRODUCTION TO APPLIED LINGUISTICS
Reading literacy. Definition of reading literacy: “Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
General Notes on Stylistics
Data Preliminaries CSC 600: Data Mining Class 1.
Advanced Computer Systems
Assessment.
Review course concepts
Measuring Monolinguality
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Assessment.
Experimental Psychology
Use the 9 steps to success!
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Natural Language Processing (NLP)
Terminology problems in literature mining and NLP
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Title: Validating a theoretical framework for describing computer programming processes 29 November 2017.
Experimental Psychology PSY 433
Literature Paper 1 exam Section A: Shakespeare Macbeth
Business and Management Research
Year 10 Accelerated English
European Network of e-Lexicography
N-Grams and Corpus Linguistics
Multi-Dimensional Data Visualization
Preparation for the American Literature Eoc
Author: dr. Martin Rusnák
Contextual Analysis Context governs our linguistics choice.
Preparing a Speech LESSON AFNR D3-4.
Social Research Methods
CSCD 506 Research Methods for Computer Science
iSRD Spam Review Detection with Imbalanced Data Distributions
Text Mining & Natural Language Processing
Business and Management Research
Statistical n-gram David ling.
Introduction: Statistics meets corpus linguistics
Text Mining & Natural Language Processing
Lecture 13 Corpus Linguistics I CS 4705.
Using GOLD to Tracking L2 Development
Introduction to Text Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
RESEARCH BASICS What is research?.
Political Cartoons.
Natural Language Processing (NLP)
Data Preliminaries CSC 576: Data Mining.
The quality of choices determines the quantity of Key words
LITERATURE REVIEW by Moazzam Ali.
From Unstructured Text to StructureD Data
Presented By: Grant Glass
Experimental Psychology PSY 433
Natural Language Processing (NLP)
Presentation transcript:

Introduction to Textual Analysis Mikal Eckstrom and Gabi Kirilloff Digital Humanities Bootcamp 2016

What we are covering What is textual analysis How it strengthens the humanities Its application in the classroom and to your research Terminology Various online methods http://textalyser.net/ http://docs.voyant-tools.org/tools/links/ https://books.google.com/ngrams/

Text Analysis “It’s not that we no longer read books, but we now have new ways of studying them in their natural habitat.”-Matthew Jockers (2013) “But it must be recognized that the notion of “probability of a sentence” is an entirely useless one, under any known interpretation of this term.” Noam Chomsky (1969) Mikal

What is text analysis? Analyzing text(s) through computational analysis that employs new methodologies in an effort to construct new meaning to an already existing (set of) written work. Mikal

Text as Science We often have an hypothesis—even as close readers We have conclusions—even our own worst paper has conclusions Now with text analysis, or data mining, we, like scientists, have data. Like scientists, digital humanists also seek to discover new evidence and meaning from texts, no matter what the scale of the corpora is. Mikal

Mikal

Terminology Sentence: unit of written language Utterance: unit of spoken language Word Form: the inflected form as it actually appears in the corpus Lemma: an abstract form, shared by word forms having the same stem, part of speech, word sense – stands for the class of words with same stem Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words Mikal

What Text Analysis Enables What you can do: Categorize and Cluster documents Compare and contrast vocabulary Examine syntactical relationships Entity Recognition This can allow you to: Examine differences based on metadata Examination of voice and style Geographic mapping and helpful visualizations Gabi

Clustering and Examining Similarity Context Words High Frequency Words Punctuation Sentence Length Gabi

Exploring Syntactical Relationships “He quickly ran up the old steps to the castle.” Gabi

Word Clouding | Text Analysis Mikal American Indian Male Jewish Male Jewish Female

Data Collection Getting good data is trickier than you think Large Corpus Metadata Clean text Where to find data Hathitrust Internet Archive Gutenberg Women Writers Project Gabi and Mikal

Martha Ballard’s Diary http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/ Mikal

Textalyzer http://textalyser.net/ Mikal

Voyant http://voyant-tools.org/ Gabi

WordSeer http://wordseer.berkeley.edu/ Gabi

Stanford Tools NER: http://nlp.stanford.edu:8080/ner/ DParse: http://nlp.stanford.edu:8080/parser/ Gabi

N-Grams https://books.google.com/ngrams/ Mikal

Human Word Prediction Clearly, at least some of us have the ability to predict future words in an utterance. How? Domain knowledge: red house vs. red hat Syntactic knowledge: the…<adj|noun> Lexical knowledge: baked <steak vs. cake> Mikal

Useful Applications for N-Grams Why do we want to predict a word, given some preceding words? Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR Theatre owners say popcorn/unicorn sales have doubled... Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation The doctor recommended a cat scan. El doctor recommendó una exploración del gato. Mikal

Coding (and why you might want to consider it) Custom questions may call for custom methods Understanding the options available to you can make it easier to envision new research questions R Statistical language Works with plain text and XML Very easy to create complex visualizations Python Gabi and Mikal

Limitations and Constraints “Flattening” data and obscuring information Corpus selection bias Imperfect datasets Gabi

Summary Text analysis can allow us to derive new meaning from text Visually understand the relationships between various texts, tokens, and data sets. N-gram probabilities can be used to estimate the likelihood Of a word occurring in a context (N-1) Of a sentence occurring at all Smoothing techniques deal with problems of unseen words in corpus

Resources Stanford Lit Lab Pamphlets: http://litlab.stanford.edu/LiteraryLabPamphlet4.pdf Ted Underwood: http://tedunderwood.com/2012/08/14/where-to- start-with-text-mining/ Lincoln Mullen: http://lincolnmullen.com/projects/dh-r/

Example Exercise Split into groups of 3 or 4 people and take 10 minutes to use Voyant to explore your text. Report to the group at least 1 interesting finding.