Introduction to Text Analysis

Slides:



Advertisements
Similar presentations
GSSR Research Methodology and Methods of Social Inquiry January 10, 2012 Research Using Available Data.
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Stemming, tagging and chunking Text analysis short of parsing.
Welcome to Turnitin.com’s Peer Review! This tour will take you through the basics of Turnitin.com’s Peer Review. The goal of this tour is to give you.
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
General Considerations for Implementation
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
CSC 594 Topics in AI – Text Mining and Analytics
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
© 2012 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the U.S.
Sentimental feature selection for sentiment analysis of Chinese online reviews Lijuan Zheng 1,2, Hongwei Wang 2, and Song Gao 2 1 School of Business, Liaocheng.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Language Identification and Part-of-Speech Tagging
Ricardo EIto Brun Strasbourg, 5 Nov 2015
3.3 Fundamentals of data representation
Vocabulary Module 2 Activity 5.
Probability and Statistics
Statistical NLP: Lecture 7
Sentiment analysis algorithms and applications: A survey
3.3 Fundamentals of data representation
Fullerton College SLOA Workshop:
Text Based Information Retrieval
Content analysis, thematic analysis and grounded theory
NLP Assignments for Undergraduates (1)
Process & Logic Modeling
CS 430: Information Discovery
CS 430: Information Discovery
The Research Paper: An Overview of the Process
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
EUROPEAN DIGITAL PORTFOLIO FOR UNIVERSITY STUDENTS
Dr. A .K. Bhattacharyya Professor EEI(NE Region), AAU, Jorhat
Introduction to Textual Analysis
Chapter 1: Introduction to Computers and Programming
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
Writing Analytics Clayton Clemens Vive Kumar.
Writing the Persuasive/Argumentative Essay
Multi-Dimensional Data Visualization
1.2 Feedback 2017 Visual Text Level 1 English
CS 430: Information Discovery
Helping to Create a Coherent Essay Structure
N-Gram Model Formulas Word sequences Chain rule of probability
Writing a literary analysis essay
Chapter 7 Lexical Analysis and Stoplists
Text Mining & Natural Language Processing
Statistical n-gram David ling.
Text Mining & Natural Language Processing
Data Illustrated by Tag Clouds
Planning Training Programs
PubMed/Limits and Advanced Search (module 4.2)
PolyAnalyst Web Report Training
Presentation Planning
The quality of choices determines the quantity of Key words
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Clinically Significant Information Extraction from Radiology Reports
Presentation Planning
Cases Admin Training.
Introduction to JMP Text Explorer Platform
Text Analytics Solutions with Azure Machine Learning
TECHNICAL REPORTS WRITING
Presentation transcript:

Introduction to Text Analysis

Learning Objectives Upon completing this assignment, students will: Become familiar with basic concepts of natural language processing (NLP) and text analysis Learn some of the commonly used techniques for preparing digital text for analysis

Natural Language Processing Natural Language Processing (NLP) is a broad field encompassing the automated interpretation of human language. Within NLP, text analysis is the process of extracting meaning from digital text. This original text data could be in the form of documents, webpages, news articles, email, social media posts, etc.

Why Use Text Analysis? Text analysis is powerful because it can help extract useful information from large quantities of text data. Text analysis minimizes the human effort required to consume digital text, and results in quantitative knowledge gathered from the digital text.

Why Use Text Analysis? (cont.) For example, if you were director of customer satisfaction for a hotel chain, the online customer feedback would be important. You probably would not be able to efficiently read the thousands of reviews hotel guests leave on various websites.

Why Use Text Analysis? (cont.) Using one of the tools of text analysis—sentiment analysis—you can identify lists of “positive” and “negative” words/phrases and then automatically count the appearance of these words/phrases in the reviews.

Why Use Text Analysis? (cont.) You can monitor the word frequency data from the reviews over time to see if positive or negative sentiment from customers statistically changes.

Preparing the Text Data Before using more advanced tools such as sentiment analysis, it is useful to learn about some of the required steps used to prepare the data prior to the analysis. Even if you are using a vendor’s off-the-shelf product for your text analysis needs, learning about the underlying techniques is necessary to consider the assumptions that may have been made during the preparation of the data.

Segmenting Text into Words Exact steps for cleaning and preparing the raw text data will depend upon (1) the source of the text data, (2) the purpose of the analysis, and (3) your programming tools. Processing large amounts of raw text for analysis always entails chopping it into smaller pieces. Text is a linear sequence of characters/symbols and white spaces. A “token” is a series of characters treated as meaningful unit. Typically these would be words.

Tokenization The list of tokens becomes input for further processing. Word tokenization transforms the text into a series of word tokens. At this stage, certain characters—such as some punctuation—are typically also eliminated. For example, the above English sentences would be turned into the following list of tokens: ‘tokenization’, ‘transforms’, ‘the’, ‘text’, ‘into’, ‘a’, ‘series’, ‘of’, ‘word’, ‘tokens’, ‘at’, ‘this’, ‘stage’, ‘certain’, ‘characters’, ‘such’, ‘as’, ‘some’, ‘punctuation’, ‘are’, ‘typically’, ‘also’, ‘eliminated’ The list of tokens becomes input for further processing.

Part of Speech (POS) Tagging Tagging the tokens as nouns, verbs, adjectives, etc., is important for the accuracy of further steps of language processing. POS-tagging takes a sequence of words and attaches a part of speech to each word using both the word’s definition and relationship with adjacent words.

Out of Vocabulary - OOV When words are not in the system’s dictionary they are considered “out of vocabulary,” or OOV. Depending on the dictionary and rules your system is using, proper nouns, professional jargon, compound words, abbreviations, and slang may be OOV words. Keep in mind your system may be designed to eliminate OOV words while “cleaning” text--even though retaining these non-standard words may be important to your final analysis.

Lemmatization Lemmatization is the process of reducing words down to their base form, or dictionary form, known as the lemma. The verb “to eat” may appear in the text as “eat,” “ate,” “eats,” or “eating.” The base form, “eat,” is the lemma for the word. Generally, lemmatization means removing inflectional endings. However, reducing the word to its lemma may also mean substitution of a synonym based on context.

Lemmatization (cont.) For example, review the following sequence of tokens before and after lemmatization: Before lemmatization: ‘a’, ‘background’, ‘in’, ‘statistics’, ‘is’, ‘expected’, Lemma: ‘a’, ‘background’, ‘in’, ‘statistic’, ‘be’, ‘expect’,

Removing Stop Words x The next step is typically to filter out the most commonly used words in the English language. These are words used to construct meaningful sentences, but are not very meaningful in themselves. These words are called stop words. There is no universal list. A typical stop word list has approx. 125-150 words.

Counting Words Taking the filtered list and running a word frequency process determines the number of times each token occurs in the digital text being analyzed. Word frequency, a measure of the relative weight given to a word in the corpus analyzed, is typically a precursor to further analysis, such as sentiment analysis, readability analysis, and topic modeling.

Adjacent Words Collocation refers to words commonly appearing near each other. An example is a bigram, which is two adjacent tokens in a sequence of tokens. Returning to the hotel review example, guests often indicate whether they would “stay again” or “not stay” (again). Counting the occurrence of those two bigrams could be relevant as indications of positive or negative sentiment. unigram = 1 token bigram = 2 adjacent tokens trigram= 3 adjacent tokens n-gram = n adjacent tokens

Notes on Text Preparation It is worth being aware of the text preparation in order to make sure something important to your final analysis has not inadvertently been “cleaned.” For example, we noted in the hotel reviews that the bigrams “would stay,” and (would) “not stay” could be important indicators of sentiment. However, the word “not” is usually included in stopword lists. Expertise in the text being analyzed is key to achieving better results.

Link to Assignment Click the Link to Platform button or use the following link to directly access the text processing tool: https://wrds-classroom.wharton.upenn.edu/nlp/ This interactive tool uses the entire text of Jane Austen’s Pride and Prejudice as the source of the text data.

Assignment In the tool, the frequency distribution has been visualized as a word cloud. In word clouds, the size of each word indicates its frequency. Select different options in the interactive tool to see what happens to the word cloud as you choose among different text processing methods. A data table beneath the word cloud contains the complete word lists.

Conclusion Text analysis uses automated processes to extract meaning from digital text. Tokenization, part-of-speech tagging, lemmatization, and the elimination of stop words are all techniques used to prepare raw digital text data for analysis. Counting the frequency of specific words in the text can provide helpful information about the text, and is a precursor to further analysis.