Download presentation
Presentation is loading. Please wait.
Published byWilfrid Shields Modified over 9 years ago
1
Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003
2
Some basic questions Where are our corpora? Where is the software? – Is there a list of all the stuff we have? – How can I access the software? Where do I start? What information is available where? – Are there tutorials for the available software? What kind of corpus work is supported at Stanford? – Corpora are only for those computational folks … ;-) And the most important question:
3
Why bother at all … Because we are often wrong with our (ad- hoc) intuitions – linguistic methodology is … – well, let’s not go there. While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities. To illustrate my point, a little case study …
4
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity” Talk for the Sem. Workshop, 10/31/2002 Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.” – 0.5 apples/apple – 1.0 apples/apple – 1.5 apples/apple – zero apples/apple
5
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity” Talk for the Sem. Workshop, 10/31/2002 Hagit Borer’s judgments: – 0.5 apples/*apple – 1.0 apples/*apple – 1.5 apples/*apple – zero apples/*apple
6
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity” Talk for the Sem. Workshop, 10/31/2002 Google’s count: – 0.5 apples (120)/*apple (179) – 1.0 apples (42)/*apple (23,600) – 1.5 apples (59)/*apple (362) – zero apples (194)/*apple (124) This also makes clear, some of the problems, so let’s take pears
7
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity” Talk for the Sem. Workshop, 10/31/2002 Google’s count: – 0.1 pears (32)/*pear (118) – 0.5 pears (37)/*pear (50) – 0.7 pears (9)/*pear (14) – 1.0 pears (14)/*pear (24,000) – 1 pears (14)/?pear (7,480) – One pears (1,130)/?pear (3,060) – 1.5 pears (28)/*pear (316) – zero pears (3)/*pear (0) Conclusion: – It is amazing how many programs or computers products use fruit names. – The original judgments seem questionable. BUT: can we trust Google?
8
… GSearch Tutorial Grep Tutorial Tgrep Tutorial CQP Tutorial In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc. Local Support E-list & Corpus TA
9
Looking for a corpus There are several sites on the web that can help you to find out if what you are looking for exists: – Databases like David Lee’s site (see also our Top 10 list)David Lee’s siteour Top 10 list – The LDC databaseLDC database – Our list of corpora (next page) Email lists, see our site under ‘Support’ – Local: corpora@csli.stanford.educorpora@csli.stanford.edu – Global: MAJORDOMO@UIB.NOMAJORDOMO@UIB.NO
10
Types of corpora Different languages Different media (speech, video, text) Different levels of annotation – No annotation – Transcribed speech or video – Sociological annotation (gender of speaker, average age of audience, dialect of speaker, etc.) – Discourse and textual information (publication date, number of discourse participants, discussion panel vs. novel, etc.) – Linguistic annotation (phonemes, prosody, syntax, morpho- syntax, lexemes, phonological segments & syllables, etc.)
11
Looking for a specific corpus List of available corpora – If the corpus is on AFS – If the corpus in on the Corpus Computer – If the corpus is on CD – If the corpus is on the WWW – If the corpus has special license conditions – If we don’t have the corpus
12
… GSearch Tutorial Grep Tutorial Tgrep Tutorial CQP Tutorial In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc. Local Support E-list & Corpus TA
13
Tools & software General Where to start: – Local online tutorials (see also external references and manuals) Local online tutorials – The corpus TA – corpora@csli.stanford.edu corpora@csli.stanford.edu Little helpers
14
A brief look at some tools BNC Web – Problem: Superiority “who the hell …”“who the hell …” – Problem: Distribution of “… is like …” – age dependent?Distribution of “… is like …” General information Age (easy export to e.g. Excel)Excel Crosstabs TGrep2 and Tgrep – Tutorial Tutorial – Examples: tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'
15
Note: Tgrep is right-headed The following pattern matches an S which has a child A and another child that is a C and that the A has a child B: – S < (A < B) < C However, this pattern means that S has child A and that A has children B and C: – S < ((A < B) < C) It is equivalent to this: – S < (A < B < C)
16
Some more Tgrep2 syntax A < B A is the parent of (immediately dominates) B. A > B A is the child of B. A <N B B is the Nth child of A (the rst child is <1). A >N B A is the Nth child of B (the rst child is >1). A <, B Synonymous with A <1 B. A >, B Synonymous with A >1 B. A <-N B B is the Nth-to-last child of A (the last child is <-1). A >-N B A is the Nth-to-last child of B (the last child is >-1). A <- B B is the last child of A (synonymous with A <-1 B). A >- B A is the last child of B (synonymous with A >-1 B). A <` B B is the last child of A (also synonymous with A <-1 B). A >` B A is the last child of B (also synonymous with A >-1 B). A <: B B is the only child of A A >: B A is the only child of B A << B A dominates B (A is an ancestor of B).
17
Some more TGrep2 syntax A >> B A is dominated by B (A is a descendant of B). A <<, B B is a left-most descendant of A. A >>, B A is a left-most descendant of B. A <<` B B is a right-most descendant of A. A >>` B A is a right-most descendant of B. A <<: B There is a single path of descent from A and B is on it. A >>: B There is a single path of descent from B and A is on it. A. B A immediately precedes B. A, B A immediately follows B. A.. B A precedes B. A,, B A follows B. A $ B A is a sister of B (and A 6= B). A $. B A is a sister of and immediately precedes B. A $, B A is a sister of and immediately follows B. A $.. B A is a sister of and precedes B. A $,, B A is a sister of and follows B. A = B The node matched by A is also matched by B.
18
The alternative with windows TigerSearch 2.1; screen shots: – Grammar search Grammar search – Collocation search Collocation search
19
The end my friends Want to help? – The website can always use additions (short blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.) Tschuessi!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.