Download presentation
Presentation is loading. Please wait.
Published byDonna Wilkerson Modified over 9 years ago
1
Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics
2
Starting Questions 1. Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position? 2. Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text? 3. If so, are there any common patterns in text-initial clusters?
3
Context Textual Priming Project, University of Liverpool Michael Hoey Michaela Mahlberg Matthew O’Donnell Mike Scott
4
Textual Priming Project: Aims to investigate how many (and what types of) lexical items are primed to appear in text-initial or paragraph-initial position to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts. to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis. from O’Donnell et al 2007
5
Hard News Corpus “Home News” sections of the Guardian and Observer 1998 to 2004 115,654 articles divided thus: headline & lead 1 st sentence of 1 st paragraph (TISC) all other sentences TISC contains 3.2 million tokens The rest: 51.2 million tokens About 470 words per article
6
Research Questions Using the hard news corpus, 1. How many 3-5 word clusters are found to be key in TISC sections? 2. How many are positively and how many are negatively key? 3. What recurrent patterns can be found in the two types of key cluster?
7
Methods (1) 1. Format the corpus in XML and separate out all TISC sections (done by Matt O’Donnell) 2. Use WordSmith’s WordList tool to compute wordlist indexes of 1. all the text 2. all the TISC sections 3. Using WordList, compute 3-5 word clusters for each index, save as.lst
8
Top clusters, all sections GUARDIAN CO UK ONE OF THE A HREF HTTP, WWW GUARDIAN CO and similar web links THE PRIME MINISTER THE END OF AS WELL AS THE NUMBER OF THERE IS A SOME OF THE THERE IS NO
9
Top clusters, TISC ONE OF THE ACCORDING TO A LAST NIGHT AFTER FOR THE FIRST THE FIRST TIME IS TO BE FOR THE FIRST TIME THE MURDER OF ARE TO BE THE DEATH OF OF THE MOST THE HOME SECRETARY WAS LAST NIGHT IT EMERGED YESTERDAY AS PART OF AN ATTEMPT TO THE UNITED STATES THE NUMBER OF ONE OF THE MOST ACCORDING TO THE
10
Methods (2) 4. Use KeyWords tool to compute KWs for the TISC 3-5 word clusters using all the text as a reference corpus 5. Identify patterns in the KW clusters
11
TISC key clusters ACCORDING TO A LAST NIGHT AFTER IT EMERGED YESTERDAY WAS LAST NIGHT ARE TO BE THE MURDER OF LAST NIGHT WHEN THE GOVERNMENT YESTERDAY LAST NIGHT AS IS TO BE WERE LAST NIGHT YESTERDAY AFTER A TONY BLAIR YESTERDAY COURT HEARD YESTERDAY WAS TOLD YESTERDAY WAS JAILED FOR THE DEATH OF YEAR OLD BOY YESTERDAY WHEN THE WITH THE MURDER OF
12
Numbers of Key Clusters
13
RQs 1 & 2: Numbers of KW clusters using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic, 8,132 key clusters altogether (in 3.2 million words of text) of which 7,631 were positively key and 501 negatively key though there is repetition as these are 3-5 word n-grams Research Question 2
14
Repetition YESTERDAY FOUND GUILTY YESTERDAY FOUND GUILTY OF YESTERDAY FROM A YESTERDAY FROM THE YESTERDAY GAVE A YESTERDAY GAVE HIS YESTERDAY GAVE THE YESTERDAY GIVEN A YESTERDAY GIVEN THE YESTERDAY GIVEN THE GO YESTERDAY GIVEN THE GO AHEAD
15
Negatively key: A LOT OF A SPOKESMAN FOR THERE IS NO HE SAID THE SAID IT WAS THERE IS A THIS IS A THE FACT THAT AS WELL AS IT WOULD BE SPOKESMAN FOR THE PER CENT OF WE HAVE TO SAID THAT THE BUT IT IS AT A TIME A SPOKESMAN FOR THE SAID HE WAS IT IS NOT THERE WAS NO
16
RQ 1: Numbers of KW clusters Is 8 thousand a large number of distinct key text-initial clusters? In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether… about one in 10 is associated with text initial position at the.0000001 level of significance
17
RQ 1, continued … is 1 in 10 a large number to be key? In the case of SISC (sentences from paragraphs with only one sentence in), we get 507 thousand clusters, of which 2,192 are key (1,747 positively and 445 negatively) which is about 1 in 230
18
PATTERNS
19
RQ 3: patterns recency: in the top 200, seventy express time, generally using yesterday or last night
20
Recency clusters COURT HEARD YESTERDAY TONY BLAIR YESTERDAY YESTERDAY AFTER A WERE LAST NIGHT LAST NIGHT AS THE GOVERNMENT YESTERDAY LAST NIGHT WHEN WAS LAST NIGHT IT EMERGED YESTERDAY LAST NIGHT AFTER YESTERDAY IN A IT EMERGED LAST NIGHT A COURT HEARD YESTERDAY YESTERDAY WHEN A YESTERDAY AFTER THE EMERGED LAST NIGHT LAST NIGHT TO YESTERDAY AS THE YESTERDAY WHEN THE WAS TOLD YESTERDAY
21
Superlatives ONE OF BRITAIN'S MOST ONE OF THE MOST OF THE WORLD'S THE FIRST TIME OF BRITAIN'S MOST FOR THE FIRST FOR THE FIRST TIME
22
Research, Report etc. ACCORDING TO A REPORT A COURT HEARD (YESTERDAY) ACCORDING TO RESEARCH TO A SURVEY IT EMERGED LAST NIGHT IT WAS ANNOUNCED YESTERDAY IT WAS REVEALED YESTERDAY A REPORT PUBLISHED ACCORDING TO A STUDY TO RESEARCH PUBLISHED
23
Attention-grabbers IT EMERGED THAT OBSERVER CAN REVEAL THE OBSERVER CAN REVEAL
24
Indefinite articles positively key…. A BABY GIRL A BAN ON A BEACH IN A BID TO A BITTER ROW A BLACK MAN A BLISTERING ATTACK ON A JURY WAS TOLD YESTERDAY A LABOUR MP A LANDMARK RULING A LAST DITCH ATTEMPT TO A LAST MINUTE A LEADING BRITISH A LEADING SCIENTIST A LEGAL BATTLE A LEGAL CHALLENGE
25
Indefinite articles negatively key A KIND OF A COUPLE OF A GREAT DEAL A KIND OF A LOT MORE
26
IT + reporting verb – positively key IT WAS ANNOUNCED LAST NIGHT IT WAS CLAIMED LAST NIGHT IT WAS CONFIRMED LAST NIGHT IT IS REVEALED TODAY
27
IT otherwise negatively key: IT IS A IT IS ABOUT IT IS EXPECTED IT IS GOING IT IS ONLY IT IS POSSIBLE IT SEEMS TO
28
SAID YESTERDAY – positively key SAID YESTERDAY AFTER SAID YESTERDAY THAT HE SAID YESTERDAY THEY HAD
29
SAID without time – negatively key SAID AT THE SAID HE HAD SAID HE WOULD SAID THE GOVERNMENT SAID THERE WAS NO
30
Conclusions The “once upon a time” syndrome seems to be much more common than might be thought. In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance whereas in non text-initial sections only 1 in 230 was key
31
Other patterns recency superlatives research, report attention-grabbers indefinite articles IT + reporting verb; SAID + time
32
O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007. References
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.