Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution Alexander Gelbukh www.Gelbukh.com

2 Previous Chapter: Conclusions Reducing synonyms can help IR Better matching Ontologies are used. WordNet Morphology is a variant of synonymy widely used in IR systems Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers Rule-based stemmers. Porter stemmer Statistical stemmers

3 Previous Chapter: Research topics Constructing and application of ontologies Building of morphological dictionaries Treatment of unknown words with morphological analyzers Development of better stemmers Statistical stemmers?

4 Contents Tagging: for each word, determine its POS (Part of Speech: noun,...) and grammatical characteristics WSD (Word Sense Disambiguation): for each word, determine which homonym is used Anaphora resolution: For a pronoun (it,...), determine what it refers to

5 Tagging: The problem Ambiguity of parts of speech rice flies like sand = insects living in rice consider send good? = rice can fly similarly to sand?... insect of a container with rice...? We can fly like sand... We think fly like sand... Ambiguity of grammatical characteristics He have read the book He will read the book... He read the book Very frequent phenomenon, nearly at each word!

6 Tagger... A program that looking at the context and decides what the part of speech (and other characteristics) are Input: He will read the book Morphological analysis He will read the ? ? ? ? ? Ns = noun singular, Tags: Tagger Va = verb auxiliary, Vpa = verb past Vpp = verb past participle, Vinf = verb infinitive,...

7...Tagger Input of tagger He will read the Task: Choose one! Output: He will read the How we do it? He will not possible Va will read Vinf This is simple, but imagine He is ambiguous... Explosion

8 Applications Used for word sense disambiguation: Oil well in Mexico is used. Oil is used well in Mexico. For stemming and lemmatization Important for matching in information retrieval Greatly speed ups syntactic analysis Tagging is local No need to process the whole sentence to find that a certain tag is incorrect

9 How: Parsing? We can find all the syntactic structures Only the correct variants will enter the syntactic structure will + Vinf form a syntactic unit will + Vpa do not Problems Computationally expensive What to do with ambiguities? fly rice like sand Depends on what you need

10 Statistical tagger Example: TnT tagger Based on Hidden Markov Model (HMM) Idea: Some words are more probable after some other words Find these probabilities Guess the word if you know the nearby ones Problem: Letter strings denote meanings x is more probable after y are meanings, not strings so guess what you cannot see: meanings

11 Hidden Markov Model: Idea A system changes its state What a person thinks Random... but not completely (how?) In each state, it emits an output What he says when he thinks something Random... but somehow (?) depends on what he thinks We know the sequence of produced outputs Text: we can see it! Guess what were the underlying states Hidden: we cannot see them

12 Hidden Markov Model: Hypotheses A finite set of states: q 1... q N (invisible) POS and grammatical characteristics (language) A finite set of observations: v 1... v M Strings we see in the corpus (language) A random sequence of states x i POS in the Probabilities of state transitions P(x i+1 | x i ) Language rules and use Probabilities of observations P(v k | x i ) words expressing the meanings: Vinf: ask, V3: asks

13 Hidden Markov Model: Problem Same observation corresponds to different meaning Vinf: read, Vpp: read Looking at what we can see, guess what we cannot This is why hidden Given a sequence of observations o i The text: sequence of letter strings. Training set Guess the sequence of states x i The POS of each word Our hypotheses on x i depend on each other Highly combinatorial task

14 Hidden Markov Model: Solutions Need to find the parameters of the model: P(x i+1 | x i ) P(v k | x i ) Optimal way! To maximize the probability of generation this specific output Optimization methods from Operation Research are used More details? Not so simple...

15 Brill Tagger (rule-based) Erik Brill Makes an initial assumption about POS tags in the text Uses context-dependent rewriting rules to correct some tags Applies them iteratively Learns the rules from a training corpus The rules are in human-understandable form You can correct them manually to improve the tagger Unlike HMM which are not understandable

16 Word Sense Disambiguation Query: international bank in Seoul Bank: financial institutionKorean$ river shoresuperiorofficial place to store something............ Hotel located at the beautiful bank of Han river. Relevant for the query? POS is the same. Tagger will not distinguish them

17 Applications Translation Great Governor of the Court 10 thousand won international bankbanco internacional river bankorilla del río Information retrieval Document retrieval: is really useful? Same info Passage retrieval: can prove very useful! Semantic analysis

18 Representation of word senses 1.Explanations. Semantic dictionaries Bank 1 is an institution to keep money Bank 2 is a sloppy edge of a river 2.Synsets and ontology: WordNew (HanNet: Chinese) Synonyms: {bank, shore} WordNet terminology: synset #12345 Corresponds to all ways to call a concept Relationships: #12345 IS_PART_OF #67890 {river, stream} #987 IS_A #654 {institution, organization} WordNet has also glosses

19 Task Given a text (probably POS-tagged) Tag each word with its synset number #123 or dictionary number bank 1 Input: Mary keeps the money in a bank. Han rivers bank is beautiful. Output Mary keeps the money in a bank Han rivers bank is beautiful.

20 Lesk algorithm Michael Lesk Explanatory dictionary Bank 1 is an institution to keep money Bank 2 is a sloppy edge of a river Mary keeps her money (savings) in a bank. Choose the sense which has more words in common with immediate context Improvements (Pedersen, Gelbukh & Sidorov) Use synonyms when no direct matches Use synonyms of synonyms,...

21 Other word relatedness measures Lexical chains in WordNet The length of the path in the graph of relationships Mutual information: frequent co-occurrences Collocations (Bolshakov & Gelbukh) Keep in bank 1 Bank 2 of river Very large dictionary of such combinations Number of words in common between explanations Recursive: common words or related words (Gelbukh & Sidorov)

22 Other methods Hidden Markov Models Logical reasoning

23 Yarowskys Principles David Yarowsky One sense per text! One sense per collocation I keep my money in the bank 1. This is an international bank 1 with a great capital. The bank 2 is located near Han river. 3 words vote for institution, one for shore Institution! bank 1 is located near Han river.

24 Anaphora resolution Mainly pronouns. Also co-reference: when two words refer to the same? John took cake from the table and ate it. John took cake from the table and washed it. Translation into Spanish: la she table / lo he cake Methods: Dictionaries Different sources of evidence Logical reasoning

25 Applications Translation Information retrieval: Can improve frequency counts (?) Passage retrieval: can be very important

26 Mitkovs knowledge poor method Ruslan Mitkov Rule-based and statistical-based approach Uses simple information on POS and general word classes Combines different sources of evidence

27 Hidden Anaphora John bought a house. The kitchen is big. = that houses kitchen John was eating. The food was delicious. = that eating s food John was buried. The widow was mad with grief. that burying s deaths widow Intersection of scenarios of the concepts (Gelbukh & Sidorov) house has a kitchen burying results from death & widow results from death

28 Evaluation Senseval and TREC international competitions Korean track available Human annotated corpus Very expensive Inter-annotator agreement is often low! A program cannot do what humans cannot do Apply the program and compare with the corpus Accuracy Sometimes the program cannot tag a word Precision, recall

29 Research topics Too many to list New methods Lexical resources (dictionaries) = Computational linguistics

30 Conclusions Tagging, word sense disambiguation, and anaphora resolution are cases of disambiguation of meaning Useful in translation, information retrieval, and text undertanding Dictionary-based methods good but expensive Statistical methods cheap and sometimes imperfect... but not always (if very large corpora are available)

31 Thank you! Till May 31? June 1? 6 pm

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.

Similar presentations

Presentation on theme: "Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.

Similar presentations

Presentation on theme: "Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution."— Presentation transcript:

Similar presentations

About project

Feedback