Presentation is loading. Please wait.

Presentation is loading. Please wait.

Guy Aston Compiling a corpus of transcribed speech.

Similar presentations

Presentation on theme: "Guy Aston Compiling a corpus of transcribed speech."— Presentation transcript:

1 Guy Aston Compiling a corpus of transcribed speech

2 Anyqs  A corpus for classroom use in training interpreters  Transcribed spontaneous speech (hard to come by)‏  Understandable without detailed contextual information (standard format)‏  Contemporary  Quite a lot (currently 1.4M words)‏  Easy to encode in TEI and to index with XAIRA

3 No way is this publicly available  The BBC site contains transcripts of all Any Questions programmes in the last 3 years, which you can download freely for personal non- commercial use.  But/and you cannot adapt, alter or create a derivative work except for your own personal, non-commercial use.

4 What the BBC’s original looks like …  PRESENTER: Jonathan Dimbleby PANELLISTS: Lord Falconer Malcolm Rifkind Anne McElvoy Chris Huhne FROM: Medical Women's Federation, Central London DIMBLEBY Welcome to London where we are on the edge of Regent's Park at the Royal College of Obstetricians and Gynaecologists..... On our panel: the former Lord Chancellor Charlie Falconer.... And Anne McElvoy, executive editor and columnist at the Evening Standard. [CLAPPING] Our first question please. HICKS Tom Hicks. Should Ian Blair resign?

5 Marking it up in XML…  In the Header  Programme details  Date  Participants and roles  Setting  In the Text  Topic boundaries (new question)‏  Utterance boundaries and their speakers  Sentence boundaries (based on punctuation in transcript)‏  Non-verbal events (clapping, laughter, coughs)‏  Pos tagging – CLAWS7  Alignment with audio – maybe some day ???

6 Overall document structure Any questions [Date] [Profile] [Text]

7 Profile <person name=“surname” sex =“f | m | u” role = “presenter | questioner | panellist | audience” background =“Con | Lab | Lib | journalist | academic |...”> fullname...

8 Text Welcome to London … … Tom Hicks. Should Ian Blair resign ? … …

9 The magic lines in the corpus header person u person

10 Meaning you can find occurrences for speakers with a certain  sex  role  background Try it!

11 Things to do with it (1): emphasis Agreement (most frequent adverb collocates 1L)‏  Agree (871)‏ entirely/actually/rather/completely/absolutely/broadly  Disagree (122) profoundly / fundamentally / strongly / completely

12 Things to do with it (2): subjunctives in speech  It were (215)‏  As it were (173)‏  If it were (32)‏  I wish it were (3)‏

13 Things to do with it (3): As it were A particularly Any Questions feature? A particularly male one?  Any Questions  Male speakers164 151/Mwords  Female speakers 9 30/Mwords  BNC spoken  Male speakers291 0.6 / 1000 sentences  Female speakers68 0.2 / 1000 sentences

14  deeply  alarmed, concerned, depressing, disillusioned, distressing, offended, regrettable, sceptical, shocking, unfair, upset, worrying  profoundly  disagree, wrong

15 Let alone  24 occurrences

16 Things to do with it (4): Preferred lexis of patriotism? occurrences/1000 UK –Lab 61 –Con 15 –Lib 30 –(Ukip1 United Kingdom –Lab 21 –Con 23 –Lib 15 –(Ukip0 occurrences/1000 Britain –Lab 104 –Con 139 –Lib 56 –(Ukip8

17 Thank you!  for any answers on how to get permission …

18 Utterances / Sentences Role Lab 2141 / 6845 Con 1787 / 6309 Lib 1096 / 4098 Presenter 7936 / 13318 Questioner 1180 / 2241 Other 3144 / 11535 Sex Male 1086994 Female 303986 Unknown 39 / 70 Total 17284 / 44346

19 Words – 1397872 Role/Background Lab 267019 (19.1%)‏ Con 249101 (17.8%)‏ Lib 163156 (11.7%)‏ Other panel 407118 (29.1%)‏ Total panel 1085394 (77.6%)‏ Presenters 270484 (19.3%)‏ Questioners 41950 (3.0%)‏ Audience 44 (0.0%)‏ Sex Male 1087585 (77.8%)‏ Female 309285 (22.1%)‏ Unknown 1002 (0.1%)‏

Download ppt "Guy Aston Compiling a corpus of transcribed speech."

Similar presentations

Ads by Google