Download presentation
Presentation is loading. Please wait.
Published byHillary Montgomery Modified over 9 years ago
1
Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK. www.comp.lancs.ac.uk/ucrel/
2
Comparing Corpora n Brown versus LOB (Hofland & Johansson, 1982) n Comparison at word form or annotation level n Information retrieval and extraction applications
3
Two main types n Type 1: –sample corpus v. larger ‘standard’ normative corpus n Type 2: –two (roughly) equal sized corpora
4
Main issues of concern n representativeness (balance) n homogeneity within the corpora n comparability of the corpora n reliability of statistical tests
5
Statistics n Chi-squared unreliable n Mann-Whitney (Kilgarriff 1996) n Log-likelihood (Dunning 1993)
6
Method O1 = aO2 = bN1 = c N2 = d E1 = c*(a+b) / (c+d) E2 = d*(a+b) / (c+d) LL = 2*((a*log (a/E1)) + (b*log (b/E2)))
7
Application (REVERE) n Systems engineering application n User interview transcripts, standards documents, user manuals n POS tagged with CLAWS n Semantic analysis n Wmatrix retrieval tool –Frequency profiling and KWIC
8
Air traffic control n Ethnographic studies at ATC centre –Verbatim transcripts of observations and interviews with controllers –Unstructured reports –103 pages
9
Key semantic categories Log-likelihoodSemanticWord sense (and examples from the text) tag 3366S7.1power, organising (‘controller’, ‘chief’) 2578M5flying (‘plane’, ‘flight’, ‘airport’) 988O2general objects (‘strip’, ‘holder’, ‘rack’) 643O3electrical equipment (‘radar’, ‘blip’) 535Y1science and technology (‘PH’) 449W3geographical terms (‘Pole Hill’, ‘Dish Sea’) 432Q1.2paper documents and writing (‘writing’, ‘written’, ‘notes’) 372N3.7measurement (‘length’, ‘height’, ‘distance’, ‘levels’, ‘1000ft’) 318L1life and living things (‘live’) 310A10indicating actions (‘pointing’, ‘indicating’, ‘display’) 306X4.2mental objects (‘systems’, ‘approach’, ‘mode’, ‘tactical’, ‘procedure’) 290A4.1kinds, groups (‘sector’, ‘sectors’)
10
Conclusions n Method of comparing corpora using frequency profiling n Discovery of key items n Human verification of hypotheses n Applications in study of social differentiation in the use of English vocabulary, profiling of learner English and IE in SE domain
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.