Corpus processing tools

Slides:



Advertisements
Similar presentations
Corpora in grammatical studies
Advertisements

Corpora in language variation studies
Corpus Linguistics Richard Xiao
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Using Corpora in Linguistics Introduction to WordSmith Tools for Beginners Íde O’Sullivan Regional Writing Centre
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Resources for Using Corpus Linguistics in ELT Kenji Kitao Doshisha University Kyoto, Japan S. Kathleen Kitao Doshisha Women ’ s College Kyoto, Japan.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Habeas Corpus in Your Classroom An InterACTIVE Workshop Dr. Rob TroyerIALLT Western Oregon UniversityJune 11, 2013.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Research methods in corpus linguistics Xiaofei Lu.
Memory Strategy – Using Mental Images
Masaryk University, Brno Friday 13 th September Katie Mansfield
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Researching language with computers Paul Thompson.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Colorado State University
Overview of Corpus Linguistics
Using Corpora to Teach Vocabulary Helping Students Help Themselves 1.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpora: a key part of a materials writer’s toolkit
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
POS Tagging and Morphological Analysis
AntConc is a freeware, multiplatform of application suitable for all types of users

Using Corpora in Linguistics
Computational and Statistical Methods for Corpus Analysis: Overview
ALE161 國際行銷英文簡報技巧 International Marketing Presentation Techniques
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Exploring Collocation
Topics in Linguistics ENG 331
Introduction to Corpus Linguistics: Dispersion/concordance plots
Corpus Linguistics I ENG 617
Introduction to Corpus Linguistics: Key Word Analysis
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
Automatically Enhancing Tagging Accuracy and Readability for Common Freeware Taggers Martin Weisser Center for Linguistics & Applied Linguistics Guangdong.
Topics in Linguistics ENG 331
A Search for Discipline-Specific Vocabulary
(word formation: follow up)
Lesson 2 follow up.
Using GOLD to Tracking L2 Development
A Corpus-Based Approach to Adapting Authentic Military Material
Applied Linguistics Chapter Four: Corpus Linguistics
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
Presentation transcript:

Corpus processing tools Punjaporn Pojanapunya February 26, 2018 punjaporn.poj@mail.kmutt.ac.th

Google Books Ngram Viewer Corpora & Corpus tools BNC, COCA, MICASE, DIY corpus AntConc, KeyBNC, WordSmith, Wmatrix TagAnt, CLAWS, USAS Google Books Ngram Viewer AntWordProfiler, Range Word frequency lists concordances collocations clusters keywords

BNC, COCA, MICASE, DIY Part i: corpora

A corpus “A corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent natural chunks of language selected according to specific typology.” Tognini-Bonelli (2001:2) “A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety.” McEnery, Xiao and Tono (2006:5)

Some well-known existing corpora The British National Corpus (BNC) http://corpus.byu.edu/bnc/ Corpus of Contemporary American English (COCA) http://corpus.byu.edu/coca/ The International Corpus of English (ICE) http://www.ucl.ac.uk/english-usage/projects/ice.htm Michigan Corpus of Academic Spoken English (MICASE) http://quod.lib.umich.edu/m/micase/

https://corpus.byu.edu/

BNC, COCA, MICASE, DIY corpus AntConc, KeyBNC, WordSmith, Wmatrix TagAnt, CLAWS, Wmatrix-semantic tagger Google Books Ngram Viewer AntWordProfiler, Range(AWL, GSLs) Part II: Corpus analysis tools

2 1 Corpus analysis tools Existing corpora DIY corpora Online (corpus + tools) CD, DVD (corpus) DIY corpora 2 1 Photo: https://www.corpusdata.org/intro.asp

COCA Part II: Corpus analysis tools

Corpus of Contemporary American English (COCA) 560 M words of American English 20 million words per year 1990-2017 equally divided among spoken, fiction, popular magazines, newspapers, and academic texts Useful for studying variation across genres and over time

Searching COCA https://corpus.byu.edu/coca/

Searching COCA Allows users to compare searches by: Genre: spoken, fiction, popular magazines, newspapers, and academic, or sub- genres e.g. movie scripts, sports magazines, newspaper editorial, or scientific journals Time: 1990 to 2017

chart

collocates

Searching COCA Allows users to carry out semantically-based queries of the corpus compare and contrast the collocates of two related words e.g. occur/happen, learn/study, men/women determine the difference in meaning or use between words.  find the frequency and distribution of synonyms and compare different genres

compare

Part II: Corpus analysis tools AntConc, KeyBNC, WordSmith, Wmatrix Word frequency lists concordances collocations clusters keywords Part II: Corpus analysis tools

2 1 Analyzing a DIY corpus A corpus software tool A corpus of specific genre of text A corpus software tool A tool allows users to search for target language features in a corpus. AntConc, KeyBNC Simple Concordance Program WordSmith Tools Wmatrix 2 1

"words which are significantly more frequent in one corpus than another" (Hunston, 2002, p. 68).

AL vs. Micro words which are significantly more frequent in Applied Linguistics than Microbiology

Key-BNC a target corpus vs. BNC http://crs2.kmutt.ac.th/Key-BNC/

AL vs. BNC words which are significantly more frequent in Applied Linguistics than general English

TagAnt, CLAWS, USAS Part II: Corpus analysis tools

Raw vs. Annotated corpora POS taggers: TagAnt CLAWS Semantic tagger USAS

Penn Treebank POS Tags Alphabetical list of part-of-speech tags used in the Penn Treebank Project: e.g. TagAnt

Horizontal Vertical This_DT article_NN presents_VVZ a_DT case_NN study_NN of_IN phonological_JJ types_NNS of_IN internal_JJ evaluation_NN in_IN the_DT personal_JJ oral_JJ narrative_NN of_IN one_CD non- native_JJ speaker_NN of_IN English_NP ._SENT This DT this article NN presents VVZ present a case study of IN phonological JJ types NNS type internal evaluation in the personal oral narrative one CD non-native speaker English NP . SENT

UCREL CLAWS7 Tagset

This_DT0 article_NN1 presents_VVZ a_AT0 case_NN1 study_NN1 of_PRF phonological_AJ0 types_NN2 of_PRF internal_AJ0 evaluation_NN1 in_PRP the_AT0 personal_AJ0 oral_AJ0 narrative_NN1 of_PRF one_CRD non-native_AJ0 speaker_NN1 of_PRF English_SENT ._PUN

Semantic tagset http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf

This_M6 article_Q4. 2 presents_A9- a_Z5 case_A4. 1[i1. 2. 1 study_A4 This_M6 article_Q4.2 presents_A9- a_Z5 case_A4.1[i1.2.1 study_A4.1[i1.2.2 of_Z5 phonological_Z99 types_A4.1 of_Z5 internal_M6 evaluation_A5.1 in_Z5 the_Z5 personal_S5- oral_B1 narrative_Q2.1 of_Z5 one_N1 non-native_Q3/S2mf[i2.2.1 speaker_Q3/S2mf[i2.2.2 of_Z5 English_Z1mf ._PUNC

Google Books Ngram Viewer Part II: Corpus analysis tools

Google Books Ngram Viewer

AntWordProfiler, Range (AWL, GSLs) Part II: Corpus analysis tools

Part III: Analyzing corpora

Task 1: Exploring COCA Search COCA by: Words, e.g. difference, bounce Phrases, e.g. root and branch, go + adjective Lemmas, e.g. all form of ‘interest’ Wildcards, e.g. un*ly, im*ment

Task 1: Exploring COCA Search See draw + noun would like + to + verb A list of all search results A chart display The frequency in the 5 registers The frequency in subregisters

Task 1: Exploring COCA Search for collocates Nouns near ‘discuss’ Adjectives near ‘speak’ Nouns after ‘look into’ Words starting with ‘com*’ near ‘task’ See how words and phrases vary across speech and different types of texts.

Task 1: Exploring COCA Carry out semantically-based queries Compare the search results of: Nouns with occur/happen Adjective with men/women Verbs with actor/actress

Task 2: Analyzing a specialized corpus Corpus: A corpus of research article abstracts (.txt) Corpus tool: AntConc (http://www.laurenceanthony.net/software/antconc/) Use different features in AntConc and save output: Word List Concordance Concordance plot File View Clusters Collocates Keywords Question: What do you know about a corpus?

Corpora & Corpus tools COCA, DIY corpus AntConc, KeyBNC TagAnt, CLAWS, USAS Google Books Ngram Viewer AntWordProfiler Word frequency lists concordances collocations clusters keywords

References: Adolphs, S. (2006). Introducing electronic text analysis: A practical guide for language and literacy studies. New York: Routledge. Bowker, L., & Pearson, J. (2002). Working with specialized language: a practical guide to using corpora. London/ New York: Routledge. Hunston, S. (2002). Corpora in applied linguistics. Ernst Klett Sprachen. Scott, M., & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam/Philadelphia: John Benjamins Publishing. Tognini-Bonelli, E. (2001). Corpus linguistics at work (Vol. 6). John Benjamins Publishing.