Introduction to Textual Analysis

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.

Advertisements

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.

Scalable Text Mining with Sparse Generative Models

Science and Engineering Practices

Natural Language Understanding

Differentiating Instruction Using Lexile Measures and OSLIS Developing Targets for Student Success Module I.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Process Skill identify methods used by archaeologists, anthropologists, historians, and geographers to analyze evidence.[WHS.29A] October 2014WORLD HISTORY.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

INTRODUCTION TO APPLIED LINGUISTICS

Reading literacy. Definition of reading literacy: “Reading literacy is understanding, using and reflecting on written texts, in order to achieve one’s.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

General Notes on Stylistics

Data Preliminaries CSC 600: Data Mining Class 1.

Advanced Computer Systems

Review course concepts

Measuring Monolinguality

CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Experimental Psychology

Use the 9 steps to success!

Computational and Statistical Methods for Corpus Analysis: Overview

Exploring the BNC Corpus

Natural Language Processing (NLP)

Terminology problems in literature mining and NLP

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Title: Validating a theoretical framework for describing computer programming processes 29 November 2017.

Experimental Psychology PSY 433

Literature Paper 1 exam Section A: Shakespeare Macbeth

Business and Management Research

Year 10 Accelerated English

European Network of e-Lexicography

N-Grams and Corpus Linguistics

Multi-Dimensional Data Visualization

Preparation for the American Literature Eoc

Author: dr. Martin Rusnák

Contextual Analysis Context governs our linguistics choice.

Preparing a Speech LESSON AFNR D3-4.

Social Research Methods

CSCD 506 Research Methods for Computer Science

iSRD Spam Review Detection with Imbalanced Data Distributions

Text Mining & Natural Language Processing

Business and Management Research

Statistical n-gram David ling.

Introduction: Statistics meets corpus linguistics

Text Mining & Natural Language Processing

Lecture 13 Corpus Linguistics I CS 4705.

Using GOLD to Tracking L2 Development

Introduction to Text Analysis

Applied Linguistics Chapter Four: Corpus Linguistics

RESEARCH BASICS What is research?.

Political Cartoons.

Natural Language Processing (NLP)

Data Preliminaries CSC 576: Data Mining.

The quality of choices determines the quantity of Key words

LITERATURE REVIEW by Moazzam Ali.

From Unstructured Text to StructureD Data

Presented By: Grant Glass

Experimental Psychology PSY 433

Natural Language Processing (NLP)

Presentation transcript:

Introduction to Textual Analysis Mikal Eckstrom and Gabi Kirilloff Digital Humanities Bootcamp 2016

What we are covering What is textual analysis How it strengthens the humanities Its application in the classroom and to your research Terminology Various online methods http://textalyser.net/ http://docs.voyant-tools.org/tools/links/ https://books.google.com/ngrams/

Text Analysis “It’s not that we no longer read books, but we now have new ways of studying them in their natural habitat.”-Matthew Jockers (2013) “But it must be recognized that the notion of “probability of a sentence” is an entirely useless one, under any known interpretation of this term.” Noam Chomsky (1969) Mikal

What is text analysis? Analyzing text(s) through computational analysis that employs new methodologies in an effort to construct new meaning to an already existing (set of) written work. Mikal

Text as Science We often have an hypothesis—even as close readers We have conclusions—even our own worst paper has conclusions Now with text analysis, or data mining, we, like scientists, have data. Like scientists, digital humanists also seek to discover new evidence and meaning from texts, no matter what the scale of the corpora is. Mikal

Mikal

Terminology Sentence: unit of written language Utterance: unit of spoken language Word Form: the inflected form as it actually appears in the corpus Lemma: an abstract form, shared by word forms having the same stem, part of speech, word sense – stands for the class of words with same stem Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words Mikal

What Text Analysis Enables What you can do: Categorize and Cluster documents Compare and contrast vocabulary Examine syntactical relationships Entity Recognition This can allow you to: Examine differences based on metadata Examination of voice and style Geographic mapping and helpful visualizations Gabi

Clustering and Examining Similarity Context Words High Frequency Words Punctuation Sentence Length Gabi

Exploring Syntactical Relationships “He quickly ran up the old steps to the castle.” Gabi

Word Clouding | Text Analysis Mikal American Indian Male Jewish Male Jewish Female

Data Collection Getting good data is trickier than you think Large Corpus Metadata Clean text Where to find data Hathitrust Internet Archive Gutenberg Women Writers Project Gabi and Mikal

Martha Ballard’s Diary http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/ Mikal

Textalyzer http://textalyser.net/ Mikal

Voyant http://voyant-tools.org/ Gabi

WordSeer http://wordseer.berkeley.edu/ Gabi

Stanford Tools NER: http://nlp.stanford.edu:8080/ner/ DParse: http://nlp.stanford.edu:8080/parser/ Gabi

N-Grams https://books.google.com/ngrams/ Mikal

Human Word Prediction Clearly, at least some of us have the ability to predict future words in an utterance. How? Domain knowledge: red house vs. red hat Syntactic knowledge: the…<adj|noun> Lexical knowledge: baked <steak vs. cake> Mikal

Useful Applications for N-Grams Why do we want to predict a word, given some preceding words? Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR Theatre owners say popcorn/unicorn sales have doubled... Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation The doctor recommended a cat scan. El doctor recommendó una exploración del gato. Mikal

Coding (and why you might want to consider it) Custom questions may call for custom methods Understanding the options available to you can make it easier to envision new research questions R Statistical language Works with plain text and XML Very easy to create complex visualizations Python Gabi and Mikal

Limitations and Constraints “Flattening” data and obscuring information Corpus selection bias Imperfect datasets Gabi

Summary Text analysis can allow us to derive new meaning from text Visually understand the relationships between various texts, tokens, and data sets. N-gram probabilities can be used to estimate the likelihood Of a word occurring in a context (N-1) Of a sentence occurring at all Smoothing techniques deal with problems of unseen words in corpus

Resources Stanford Lit Lab Pamphlets: http://litlab.stanford.edu/LiteraryLabPamphlet4.pdf Ted Underwood: http://tedunderwood.com/2012/08/14/where-to- start-with-text-mining/ Lincoln Mullen: http://lincolnmullen.com/projects/dh-r/

Example Exercise Split into groups of 3 or 4 people and take 10 minutes to use Voyant to explore your text. Report to the group at least 1 interesting finding.