1 2 Modern Approaches to Corpus Linguistics Dominique L ONGRÉE, LASLA – Université de Liège et FUSL (Bruxelles) automatic taggers as heuristic tools multilevel.

Slides:



Advertisements
Similar presentations
Diachronic study and language change Corpus Linguistics Richard Xiao
Advertisements

Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Uses of a Corpus “[E]xplore actual patterns of language use”
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
LIN 3098 – Corpus Linguistics Albert Gatt. In this lecture  Corpora for the study of genre/register variation revisit the concept of representativeness.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Stylistics and stylometry. 2 What is “style”? Term not much loved by linguists –Too vague –Has connotations in neighbouring fields (“style” = good style,
Stylistics and stylometry
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
Stylistics ENG 551 Lecture # 3.
Developing a Basic Web Page with HTML
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Caesar and Subjunctives Introduction of subjunctive forms, sequence of tenses, purpose clauses, and cum clauses.
ELN – Natural Language Processing Giuseppe Attardi
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Channel Oral texts Written texts Intent of the Communicator Various types of texts (procedural, expository, persuasive, narrative, descriptive)
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Film Discourse Interpretation Janina Wildfeuer Bremen Institute for Transmedial Textuality Research Faculty of Linguistics and Literary Science Bremen.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Scientific Prose Style (SPS) Literary and Linguostylistic Characteristics.
Peter Grzybek & Ernst Stadlober  Austrian Research Fund  Project #15485 Quantitative Text.
AS Latin Unit L1: Latin Language (1.5 hrs) Unit L2: Latin Verse and Prose Literature (1.5 hrs) –Ovid Amores 3, poems 2, 4, 5, 14 –Cicero.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Chapter 10 Language and Computer English Linguistics: An Introduction.
A computer environment for beginners’ learning of sorting algorithms: Design and pilot evaluation Kordaki, M., Miatidis, M. & Kapsampelis, G. (2008). A.
1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Lecture 23 Instructor Paulo Alencar.
A Language Independent Method for Question Classification COLING 2004.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
The Comparative Analysis of Advertising Texts in English and Russian Languages.
Session 6 Text Complexity ELA Educator Effectiveness Academy, Summer 2011 © Maryland State Department of Education.
Translation Studies 9. The use of corpora in TS Krisztina Károly, Spring, 2006 Sources: Olohan, 2004; Tirkkonen-Condit, 2005.
Lecture 1 Lec. Maha Alwasidi. Branches of Linguistics There are two main branches: Theoretical linguistics and applied linguistics Theoretical linguistics.
The word “text” comes from the Latin texere, “to weave.” Deriving from the Latin, most definitions place “text” as a linguistic structure woven out of.
Engaging with data Choices and decisions. Seeing or looking at? The advance of corpus linguistics has certainly changed the way that we can look at our.
Key terms Text Semiotics Semantic Syntax Pragmatics Transcoding Specialized text Non-specialized text.
O gênero entrevista oral: subsídio para o ensino de língua inglesa Campos, A.; Cristovão, V. Integrated Skills III 2011/1.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Putting it All Together Xiaofei Lu APLNG 596D July 17, 2009.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Qualifications Update: Latin Qualifications Update: Latin.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Lecture # 21.  A branch of applied linguistics concerned with the study of style in texts, especially (but not exclusively) in literary works.applied.
Chapter 9 Comprehension is dependent upon the interaction of reader factors and text factors. Reader Factors Background Knowledge Vocabulary Fluency.
Biography-Learner Outcomes Identify and critically appraise the viewpoint of the writer Distinguish between facts and opinions Compare and contrast biographies.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
2. The standards of textuality: cohesion Traditional approach to the study of lannguage: sentence as conventional object of study Structuralism (Bloofield,
Exploring Formulaicity for First Year Composition Students Robin Sulkosky.
General Notes on Styles and Stylistics
Lecture 12 Teaching L2 Reading Luo Ling
General Notes on Stylistics
Language Identification and Part-of-Speech Tagging
WP4 Models and Contents Quality Assessment
Use of Literature in Language Teaching
Grammar Grammar analysis.
THE MAIN NOTIONS OF STYLISTICS
Million Books Update: Perseus
Phil Durrant Debra Myhill Mark Brenchley
Stylistics and Stylometry
Introduction Task: extracting relational facts from text
Applied Linguistics Chapter Four: Corpus Linguistics
The Most Prominent Orator and Writer in Rome
Presentation transcript:

1 2 Modern Approaches to Corpus Linguistics Dominique L ONGRÉE, LASLA – Université de Liège et FUSL (Bruxelles) automatic taggers as heuristic tools multilevel approaches : the motives what do they have in common ?

2 Modern Approaches to Corpus Linguistics 2 1. Automatic taggers as heuristic tools  a LASLA research project : testing various automatic recognition software, know as taggers  Biber, 1993, Illouz, 1999, etc. : the quality of production can vary significantly - from one type of text to another - from one tagger to another.  Questions : - are the results better with a tagger trained - on one author or on a given text for another text - by the same author, or within the same discourse? - what can we deduce from those results regarding - the tagger or - the homogeneity of corpora?

2 Modern Approaches to Corpus Linguistics 3 1. Automatic taggers as heuristic tools  The test-texts : - book 3 of The Gallic Wars by Caesar – BGall3 (3673 tokens - The Conspiracy of Catilina by Sallust – SalCat. (10688 tokens), - book 3 of The History of Alexander the Great by Quintus Curtius – QC3 (7261 tokens), - The First Oration Against Catilina by Cicero – CicCat1 (3333 tokens) - poem 66 of Catullus – Catu66 (586 tokens)  Varying the nature of the training and evaluation corpus, in order to identify and measure variant factors : style of the work style of the author diachrony literary genre type of discourse

2 Modern Approaches to Corpus Linguistics 4 1. Automatic taggers as heuristic tools  In theoretical terms : taggers appear to have some value as heuristic instruments  For instance, highlight - the homogeneity of the historical style over and above diachronic development - the gap between narration and discourse (speeches) - the gap between the styles of Caesar and Cicero - a smaller gap between Catullus and Cicero or between Catullus and Quintus Curtius/Tacitus than the gap between Catullus and Caesar, etc

2 Modern Approaches to Corpus Linguistics 5 2. Multilevel approaches : the “motives”  Some indicators intuitively catalogued in Latin narrative prose - sequences of verb tenses - lexical elements repente, subito ‘suddenly’, ‘abruptly’ - syntactical structures / ‘linking clichés’ Quibus rebus cognitis ‘Those things being known’ Quod ubi animaduertit ‘When he had noticed that’  Limits - no very analysis as text’s structure indicators - no study of their interaction - poor use for characterising text genre and style

2 Modern Approaches to Corpus Linguistics 6 2. Multilevel approaches : the “motives”  The Discourse Modes and Bases Approach - Kroon, 2007, 2009; Adema, 2007, 2008, a priori definition of typical features for each discourse mode - in order to evaluate text homogeneity  LASLA and BCL approach - to develop endogenous exploratory methods - to take into account this text linearity - to specify functional convergences between several indicators  methods - calling upon mathematical models (neighborhoods, bursts) - combining - small-scale qualitative approach - large-scope quantitative analysis

2 Modern Approaches to Corpus Linguistics 7 3. What do these approaches have in common ?  they take texts and discourses into account in both their dimensions - the multilevel nature of texts and of languages, from phonetics to pragmatics - the fact that texts and discourses - are organized according to linearity - can be considered as topological entities.