Multiword Expressions Presented by: Bhuban Seth (09305005)Somya Gupta (10305011)Advait Mohan Raut (09305923)Victor Chakraborty (09305903) Under the guidance.

Slides:



Advertisements
Similar presentations
On translation units and automatic processing Patricia Fernández Carrelo University of Deusto CliP 2006, London, 29 June–1 July.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Cognitive Linguistics Croft & Cruse 9
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Statistical NLP: Lecture 3
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Multiword Expressions: A Pain in the Neck for NLP Emad Soliman Mohamed Nawfal Department of Linguistics.
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Research methods in corpus linguistics Xiaofei Lu.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 
9/8/20151 Natural Language Processing Lecture Notes 1.
Statistical Natural Language Processing Diana Trandabăț
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Vocabulary connections
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Vocabulary connections:multi- word items in English.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Linguistic Essentials
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Detecting a Continuum of Compositionality in Phrasal Verbs Diana McCarthy & Bill Keller & John Carroll University of Sussex This research was supported.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Collocations David Guy Brizan Speech and Language Processing Seminar 26 th October, 2006.
EXTRACTING COMPLEX PREDICATES IN HINDI ACROSS PARALLEL CORPORA
Approaches to Machine Translation
Statistical NLP: Lecture 7
Statistical NLP: Lecture 3
Natural Language Processing (NLP)
Approaches to Machine Translation
Chunk Parsing CS1573: AI Application Development, Spring 2003
Statistical n-gram David ling.
Linguistic Essentials
Introduction to Text Analysis
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Multiword Expressions Presented by: Bhuban Seth ( )Somya Gupta ( )Advait Mohan Raut ( )Victor Chakraborty ( ) Under the guidance of : Prof. Pushpak Bhattacharya.

Contents Introduction Motivation Linguistic Levels Types of MWEs Approaches to identify MWEs Limitations Conclusion References

Introduction Put the sweater on Put the sweater on the table Put the light on

Introduction Put the sweater on Put the sweater on the table Put the light on Roughly defined as: Idiosyncratic interpretations that cross word boundaries (or spaces)

Examples His grandfather kicked the bucket. This job is a piece of cake Put the sweater on He is the dark horse of the match Google Translations of above sentences: अपने दादा बाल्टी लात मारी इस काम के केक का एक टुकड़ा है स्वेटर पर रखो वह मैच के अंधेरे घोड़ा है

Motivation Multiword expressions “ Of the same order of magnitude as the number of single words ” (Jakendoff 1977) 41% - WordNet 1.7 (Fellbaum 1999) Resolution needed in: Machine Translation – Google translate Poor performance example Information Retrieval Tagging, Parsing, Question Answering System, WSD

Linguistic Levels In short, Ad hoc Lexicology Put on weight, Put the sweater on Morphology and Syntax Spill the Beans Semantics Kick the Bucket, Kick the bucket filled with water Pragmatics

How to Handle These? Variation in FlexibilitySyntactic Idiomaticity

Types ( Sag et al 2002 )

Types - Examples TypeExample Fixed In Short, Ad hoc, Palo Alto, Alta Vista Compound Nominals Congressman, Car park, Part of Speech Proper Names Deccan Chargers, Delhi Daredevils Non Decomposable Idioms Kick the Bucket Decomposable Idioms Spill the Beans, Let the Cat out Verb Particle Constructions Take off, Put on, Light Verb Constructions Give a Demo, Take a Shower Institutionalized Phrases Black and White, Traffic Light, Telephone booth

Approaches

Knowledge Based Approach 1)Word with space : Fixed expression Stemmer may be used to detect MWEs. But it fails.. Why??? Kicks the bucket  MWE Kick the buckets  Not MWE Princeton Wordnet – Flaw 2)Circumscribed Constructions: Consecutive Nouns  Most probably MWE 3) Inflection Head : Semi fixed expression Ex : part of speech  parts of speech

Statistical Approaches Co-occurrence properties Substitutability Distributional Similarity Semantic Similarity

Co-occurrence properties Example: Black and White Scan a corpus and find probabilities of bigrams and tri-grams. P(X|Y) = P(XY)/P(Y) If P(X|Y) is high, then there is a chance that word sequence ‘YX’ is a MWE. Demerit: “I am “  Not MWE.

Point-wise Mutual Information (PMI) PMI ( X,Y )= log { P(X,Y)/(P(X).P(Y))} PMI ( X,Y ) of a word pair (X,Y) is measure of strength of their collocation Other methods like students-t test and Pearson chi-square can also be used. Demerit: Need to differentiate between systematic & chance co-occurrence

Pearson’s chi-square test Based on assumption of normal distribution of word frequency, which could be a limitation Null hypothesis: the words are independent of each other. Higher the value of the chi-square statistic, the stronger the association between the words Demerit: For small data collections, assumptions of normality and chi-square distribution do not hold. Hence, large corpus required

Substitutability The ability to replace parts of lexical items with alternatives. Alternatives can be similar or opposite words with respect to tasks & approaches. Mostly after the substitution the new phrase no longer remains MWE. Can be used to remove possible Non- MWEs Src: Kim, 2008

Distributional Similarity A method to extract the semantic similarity using the context When two words are similar, then their context words are also similar Src: Kim, 2008

Semantic Similarity Similar NCs could have same semantic relations Src: Kim, 2008

Method Src: Kim, 2008

MWE Resources British National Corpus (BNC) Brown Corpus Corpus WordNet Moby’s Thesaurus- contains 30K root words & 2.5M synonyms and related words Lexical Resources WordNet::Similarity- gives measure of semantic similarity between two given words Tools

Limitations of current Approaches Many NLP approaches treat MWEs according to the words-with-spaces method Many approaches get commonly-attested MWE usages right, sometimes using “ad hoc” methods, e.g. preprocessing However, most approaches handle variation badly, fail to generalize, and result in NLP systems that are difficult to maintain and extend

Conclusion MWEs have been classified in terms of lexicalized phrases (like fixed, semi fixed and syntactically flexible) and institutionalized phrases. MWE analysis in NLP is equally important as any of the other domain like MT or WSD. Hybrid approach is most probably the best method so far to extract MWE from corpus.

References Kim, S. N. (2008). Statistical modeling of multiword expressions. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Filckinger, D. (2001). Multiword Expression : A pain in the neck for the NLP. In the proceeding of the 3rd International conference on Intelligent text processing and computational linguistics. Calzolari, N. a. (2002). Towards best practice for multiword expressions in computational lexicons. Proc. of the 3rd International conference of language resources and evaluation, (pp ).

Thank You Questions???