1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.

Slides:

Advertisements

Similar presentations

Using The Scientific Method In Research Writing Ms. Ruth, World Literature.

Advertisements

The Robert Gordon University School of Engineering Dr. Mohamed Amish

Complexity Metrics for Design & Manufacturability Analysis

Making statistic claims

Maintaining Academic Integrity Steps to Avoid the Plagiarism Plague Created by Anne Reever Osborne, MALIS Asst. Library Director for Distance Learning.

Clustering Art & Learning the Semantics of Words and Pictures Manigantan Sethuraman.

How to write a literature review Dr. Laureen Fregeau.

A Tutorial on Learning with Bayesian Networks

Genetic Statistics Lectures （５） Multiple testing correction and population structure correction.

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

The Literature Review in 3 Key Steps

A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press.

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

Part III: Inference Topic 6 Sampling and Sampling Distributions

25 Sept 07 FF8 - Discrete Choice Data Introduction Tony O’Hagan.

Research Methods for Computer Science CSCI 6620 Spring 2014 Dr. Pettey CSCI 6620 Spring 2014 Dr. Pettey.

Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

 For the IB Diploma Programme psychology course, the experimental method is defined as requiring: 1. The manipulation of one independent variable while.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

Writing a scientific paper Maxine Eskenazi Meeting 1 - Overall Structure and Content of a Paper.

Proposal Type One: Corpus-Based. The following is a list of items typically included in a Type One research proposal for MA in translation studies. The.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Researching language with computers Paul Thompson.

Chris Luszczek Biol2050 week 3 Lecture September 23, 2013.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.

Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

3-2 Random Variables In an experiment, a measurement is usually denoted by a variable such as X. In a random experiment, a variable whose measured.

Some Probability Rules Compound Events

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.

Chapter 12 Getting the Project Started Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.

For Friday Finish chapter 24 No written homework.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

COLLOCATIONS He Zhongjun Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.

Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.

Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.

When is the best time to think about jobs? Bell-Ringer: (yes or no) 1.___ Know anyone who has a hard time finding a job? 2.___ Know anyone who talks about.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,

Academic Writing Skills: Paraphrasing and Summarising Activities and strategies to help students.

SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.

Statistical NLP: Lecture 7

Corpora and Statistical Methods

Statistical NLP: Lecture 9

Research Paper Overview.

Information Retrieval

Statistical NLP : Lecture 9 Word Sense Disambiguation

Statistical NLP: Lecture 10

Presentation transcript:

1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem and Review of relevant literature (not just the study you are going to replicate, but related things too) –Description and discussion of your own results First part ( words) due in Friday 25 April Second part ( words) due in Friday 9 May No overlap allowed with LELA30122 projects –Though you are free to use that list of topics for inspiration –See LELA30122 WebCT page, “project report”

Church et al K Church, W Gale, P Hanks, D Hindle (1991) Using Statistics in Lexical Analysis, in U Zernik (ed) Lexical Acquisition: Exploiting on- line resources to build a lexicon. Hillsdale NJ (1991): Lawrence Erlbaum, pp

3/32 Background Corpora were becoming more widespread and bigger Computers becoming more powerful But tools for handling them still relatively primitive Use of corpora for lexicology Written for the First International Workshop on Lexical Acquisition, Detroit 1989 In fact there was no “Second IWLA” But this paper (and others in the collection) become much cited and well known

4/32 The problem Assuming a lexicographer has at their disposal a reference corpus of considerable size, … A typical concordance listing only works well with –words with just two or three major sense divisions –preferably well distinct –and generating only a pageful of hits Even then, the information you may be interested in may not be in the immediate vicinity

5/32

6/32 The solution Information Retrieval faces a comparable problem (overwhelming data), and suggests a solution 1.Choose an appropriate statistic to highlight information “hidden” in the corpus 2.Preprocess the corpus to highlight properties of interest 3.Select an appropriate unit of text to constrain the information extracted

7/32 Mutual Information MI: a measure of similarity Compares the joint probability of observing two words together with the probabilities of observing them independently (chance) If there is a genuine association, I(x;y)>>0 If no association, P(x,y)  P(x)P(y), I(x;y)  0 If complementary distribution, I(x;y)<<0

8/32 Top ten scoring pairs of strong y and powerful y Data from AP corpus, N=44.3m words

9/32 Mutual Information Can be used to demonstrate a strong association Counts can be based on immediate neighbourhood, as in previous slide, or on co- occurrence within a window (to left or right or both), or within same sentence, paragraph, etc. MI shows strongly associated word pairs, but cannot show the difference between, eg strong and powerful

10/32 t-test A measure of dissimilarity How to explain relative strength of collocations such as –strong tea ~ powerful tea –powerful car ~ strong car The less usual combination is either rejected, or has a marked contrastive meaning Use example of {strong|powerful} support because tea rather infrequent in AP corpus

11/32 {strong|powerful} support MI can’t help: very difficult to get value for I(powerful;support)<<0 because of size of corpus –Say x and y both occur about 10 times per 1m words in a corpus –P(x) = P(y) = and chance P(x)P(y) = –I(powerful;support)<<0 means P(x)P(y) << –ie much less than 1 in 10,000,000,000 –Hard to say with confidence

12/32 Rephrase the question Can’t ask “what doesn’t collocate with powerful?” Also, can’t show that powerful support is less likely than chance: in fact it isn’t –I(powerful;support)=1.74 –3 x greater than chance! Try to compare what words are more likely to appear after strong than after powerful Show that strong support relatively more likely than powerful support

13/32 t-test Null hypothesis (H0) –H0 says that there is no significant difference between the scores H0 can be rejected if –Difference of at least 1.65 sd’s –95% confidence –ie the difference is real

14/32 t-test Comparison of powerful support with chance is not significant t = 0.99 (less than 1 sd!) But if we compare powerful support with strong support, t = –13 Strongly suggests there is a difference

15/32

16/32 MI and t-score show different things

17/32 How is this useful? Helps lexicographers recognize significant patters Especially useful for learners’ dictionaries to make explicit the difference in distribution between near synonyms eg what is the difference between a strong nation and a powerful nation? –Strong as in strong defense, strong economy, strong growth –Powerful as in powerful posts, powerful figure, powerful presidency

18/32 Taking advantage of POS tags Looking at context in terms of POS rather than lexical items may be more informative Example, how can we distinguish to as an infinitive marker from to as a preposition? Look at words which immediately precede to –able to, began to, … vs back to, according to, … t-score can show that they have a different distribution

19/32

20/32 Similar investigation with subordinate conjunction that (fact that, say that, that the, that he) and demonstrative pronoun that (that of, that is, in that, to that) Look at both preceding and following word Distribution is so distinctive that this process can help us to spot tagging errors

21/32

22/32 subordinate conjunction demonstrative pronoun t w that/cs w that/dt w t w that/cs w that/dt w so/cs – of/in

23/32 If your corpus is parsed Looking for word sequences can be limiting More useful if you can extract things like subjects and objects of verbs (Can be done to some extent by specifying POS tags within a window, but that’s very noisy) Assuming you can easily extract, eg Ss, Vs, and Os …

24/32 What kinds of things do boats do?

25/32

26/32 What is an appropriate unit of text? Mostly we have looked at neighbouring words, or words within a defined context Bigger discourse units can also provide useful information eg taking entire text as the unit: –How do stories that mention food differ from stories that mention water?

27/32

28/32 More subtle distinctions can be brought out in this way What’s the difference between a boat and a ship? Notice how immediately neighbouring words won’t necessarily tell much of a story But words found in stories that mention boats/ships help to characterize the difference in distribution, and give a clue as to the difference in meaning Notice that human lexicographer still has to interpret the data

29/32

30/32 Word-sense disambiguation The article also shows how you can distinguish two senses of bank –Identify words which occur in the same text as bank and river on the one hand, and bank and money on the other

31/32 bank (river) vs bank (money) t bank&river bank&money w river River water feet miles near boat south fisherman along border area village drinking across east century missing t bank&river bank&money w money Bank funds billion Washington Federal cash interest financial Corp loans loan amount fund William company account deposits

32/32 Bank vs bank Bank bank t Bank bank w Gaza bank Palestinian money Israeli federal Strip company Palestinians accounts Israel central Bank cash occupied business Arab loans territories robbery