Genre in a Frequency Dictionary Adam Kilgarriff & Carole Tiberius.

Slides:



Advertisements
Similar presentations
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Have you ever wondered why some people work harder than others? Well, after you finish this project, I want to know if you think hard work helps people.
Introduction: A discourse perspective on grammar
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Welcome to National 5 French
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
The Practical Part by: Mr. Amr Samir The Points To Be Discussed 1- Difficulties and solutions in writing reports. 2- Why teach writing ? 3- What.
Place Value Third and Fourth Grade. Third Grade Number and Operations Base Ten (Common Core) 1. Use place value understanding to round whole numbers to.
Memory Strategy – Using Mental Images
Parts of a Newspaper.
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Designing Web Pages Getting to know HTML... What is HTML? Hyper Text Markup Language HTML is the major language of the Internet’s World Wide Web Web.
IKAN Assessment This is a knowledge test so the questions come then go quite quickly. The questions start easy then get harder. Try to answer as many questions.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
The Vocabulary Coverage in American Television Programs A Corpus-Based Study NA3C 0006 Christina 周惠娟 1.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
O S S L T March 27th EQAO ONTARIO SECONDARY SCHOOL LITERACY TEST.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Iowa Department of Education ::: 2006 ::: Principle 1 ::: PPT/Transparency :::R1-1 Principles Children need to interact with books Children need to retell.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Researching language with computers Paul Thompson.
1-3. Answers will vary. 1 and 2. Answers will vary. However, nutritionists recommend eating a balanced diet of foods from all four groups pictured.
ACADEMIC DISCOURSE B. Mitsikopoulou INTERPRETATION OF DATA: Analysing different types of graphs: Bar Graphs and Histograms Line Graphs Pie Charts Tables.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
The BNC Design Model Adam Kilgarriff, Sue Atkins, Michael Rundell The Lexicography MasterClass
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Elements of a Short Story. Outline Short Story Definition Theme & Setting Characters and Point of View Characterization Plot.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Data structure design Jordi Cortadella Department of Computer Science.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Handling Data: AVERAGES Learning Objective: To be able to calculate the mode, median, range and Mean KEY WORDS: MEDIAN, MODE, RANGE, MEAN and AVERAGES.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Literature can be divided into 2 groups:
Corpus approaches to discourse
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Paul Mundy Writing cases Stories that illustrate a project or problem.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.
Handling Data: AVERAGES
7 th Grade OCR Reading. Bachelors Degree - Muhlenberg College Masters Degree - The College of Saint Elizabeth Certificates Special Education K-12 Elementary.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Statistical Properties of Text
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Version 2.0 Copyright © AQA and its licensors. All rights reserved. Unit 1 LITB1 ASPECTS OF NARRATIVE.
Behind every site is a mix of special languages that your web browser understands The main way of describing any website is HTML HTML stands for Hyper.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Displaying Your Ideas on the Internet How to let everyone admire your project
Year 6 SATs Information Meeting 25 th February 2016.
Language Model for Machine Translation Jang, HaYoung.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
AIM: How can an author’s style create genre? DO NOW: Two Truths and a Lie I live on the beach, but I hate the beach. I had a cat that lived until 20 years.
Arabic Text Categorization Based on Arabic Wikipedia

ENG 437 Education for Service/tutorialrank.com
Applied Linguistics Chapter Four: Corpus Linguistics
Unit 10 The Web Book Test.
Unit 3 Lesson 13 Anchor Text: “Schools.” by Margaret C. Hall
From Unstructured Text to StructureD Data
Presentation transcript:

Genre in a Frequency Dictionary Adam Kilgarriff & Carole Tiberius

Outline Three problems Our solutions

Routledge Frequency Dictionaries Ten languages/volumes so far Series editors: Mark Davies, Paul Rayson “5000 most frequently used words” Genre/text type? – Some marking – Like traditional dictionaries

The corpus linguist’s dilemma We know that – Everything depends on text type Usually – Ignore – Pretend our corpus is representative (or we even knew what it meant) Frequency dictionary – Specially painful

Poetic interlude As many texts as stars in the sky As many domains as constellations As many genres as stories to tell about them Represent them? Shucks

A tiny step in the direction of respecting the importance of genre Instead of just one list One list per genre

The Whelks Problem Rare word but a book about whelks uses it hundreds of times Solution document frequency

The genius of Brown Fixed sample size – 500 x 2000-word samples – Makes the maths easy Frequencies directly comparable Document frequency works – No need to compensate for different sample length Contra Sinclair, Hanks – Different goals Brown: very widely used, replicated

A Frequency Dictionary of Dutch In the Routledge series Publication later this year

Dutch Written and Spoken the Netherlands and Flanders pinpas betaalkaart gij, ge jij, je

The Corpus Fiction – 25 books per year, Newspapers – From SONAR corpus, Spoken – From Corpus Gesproken Mederlands Web – From SONAR corpus, includes blogs, discussion lists, e-magazines, press releases, websites and wikipedia

Corpus preparation Tagging Lemmatisation Slice corpora into 2000-word samples

How many lists? One list per genre – But – overlap? Core: 4 genres: General:

Which words to include Which list(s) to put them in Throughout document frequency, implemented as percentage of samples that the word occurs in

Inclusion Include if average across four genres > words

Core Vocabulary words that are used across all kinds of language implemented as – Words with frequency > x in all genres 4.5 mark gives 943 core-vocab words in core-vocab only; not in other lists

WordFictionNewsSpokenWeb Ham20543 Egg Cheese Which list(s)? The problem

Our solution WordFictionNewsSpokenWeb Ham20543 Egg Cheese Which list(s)? The problem WordFictionNewsSpokenWebLists Ham20543Fiction Egg201843Fiction, News Cheese General

Algorithm Minimum > 45.5 The complication is that some words will occur in two, three or four of the lists generated in this way, and for such cases we have to decide whether they go in: just one list more than one list the general list. Our strategy is to say there should be some cases of each, as follows: if highest frequency is at least double the next highest, list in that genre only if two are high and two are low, that is, the first- and second-highest, and both more than double the other two, list in both the top two else list in general.

Algorithm Min > 4.5? – Core-vocab – Else If highest-score > 2 x second-highest-score – Highest-score-genre Else if second-highest-score > 2 x third-highest-score – Highest-score-genre and second-highest-score-genre Else – General

The ‘genre’ lists

Observations Fiction – Broadest vocabulary, longest list Spoken – Smallest, shortest Spoken and web: much overlap Fiction and news: some overlap

In sum Everything depends on genre Not easy to handle well in any dictionary Specially hard in a frequency dictionary It helps to use – Fixed sample size – Document frequencies (as percentages) A modest attempt to pay genre due respect – Routledge Frequency Dictionary of Dutch, 2013

Poetic interlude As many texts as stars in the sky As many domains as constellations As many genres as stories to tell about them Represent them? Shucks