DSC 2008 – 26-27 June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment.

Slides:

Advertisements

Similar presentations

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus Preslav.

Advertisements

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav.

Statistics Part II Math 416. Game Plan Creating Quintile Creating Quintile Decipher Quintile Decipher Quintile Per Centile Creation Per Centile Creation.

Discovering Missing Background Knowledge in Ontology Matching Pavel Shvaiko 17th European Conference on Artificial Intelligence (ECAI06) 30 August 2006,

OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.

STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.

Cultural Heritage in REGional NETworks REGNET ICCS – REGNET Dissemination Activities.

A Novel Visualization Model for Web Search Results An Application of the Solar System Metaphor Tien N. Nguyen and Jin Zhang Electrical and Computer Engineering.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.

1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)

Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.

The basics for simulations

EE, NCKU Tien-Hao Chang (Darby Chang)

You will need Your text Your calculator

1 IMDS Tutorial Integrated Microarray Database System.

Text Categorization.

Chapter 5: Query Operations Hassan Bashiri April

Traditional IR models Jian-Yun Nie.

Sets Sets © 2005 Richard A. Medeiros next Patterns.

Artificial Intelligence

Before Between After.

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.

CSE3201/4500 Information Retrieval Systems

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

WEB MINING. Why IR ？ Research & Fun

1 Chap 14 Ranking Algorithm 指導教授 : 黃三益博士學生 : 吳金山鄭菲菲.

1 McGill University Department of Civil Engineering and Applied Mechanics Montreal, Quebec, Canada.

WEB OF KNOWLEDGE 5.2

9. Two Functions of Two Random Variables

South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.

22 nd User Modeling, Adaptation and Personalization (UMAP 2014) Time-Sensitive User Profile for Optimizing Search Personalization Ameni Kacem, Mohand Boughanem,

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Improved TF-IDF Ranker

Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Mehran Sahami Timothy D. Heilman A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Information Retrieval

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Improved Word Alignments Using the Web as a Corpus Preslav Nakov, University of California, Berkeley.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network

Improved Word Alignments Using the Web as a Corpus

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Presentation transcript:

DSC 2008 – June 2008, Thessaloniki, Greece Automatic Acquisition of Synonyms Using the Web as a Corpus Svetlin Nakov, Sofia University "St. Kliment Ohridski" 3rd Annual South East European Doctoral Student Conference (DSC2008): Infusing Knowledge and Research in South East Europe

DSC 2008 – June 2008, Thessaloniki, Greece Introduction We want to automatically extract all pairs of synonyms inside given text Our goal is: Design an algorithm that can distinguish between synonyms and non-synonyms Our approach: Measure semantic similarity using the Web as a corpus Synonyms are expected to have higher semantic similarity than non-synonyms

DSC 2008 – June 2008, Thessaloniki, Greece The Paper in One Slide Measuring semantic similarity Analyze the words local contexts Use the Web as a corpus Similar contexts similar words TF.IDF weighting & reverse context lookup Evaluation 94 words (Russian fine arts terminology) 50 synonym pairs to be found 11pt average precision: 63.16%

DSC 2008 – June 2008, Thessaloniki, Greece Contextual Web Similarity What is local context? Few words before and after the target word The words in the local context of given word are semantically related to it Need to exclude the stop words: prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need of sufficiently big corpus Same day delivery of fresh flowers, roses, and unique gift baskets from our online boutique. Flower delivery online by local florists for birthday flowers.

DSC 2008 – June 2008, Thessaloniki, Greece Contextual Web Similarity Web as a corpus The Web can be used as a corpus to extract the local context for given word The Web is the largest possible corpus Contains large corpora in any language Searching some word in Google can return up to snippets of texts The target word is given along with its local context: few words before and after it Target language can be specified

DSC 2008 – June 2008, Thessaloniki, Greece Contextual Web Similarity Web as a corpus Example: Google query for "flower" Flowers, Plants, Gift Baskets FLOWERS.COM - Your Florist... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by FLOWERS.COM, Your Florist of Choice for over 30 years. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, plants, roses, & gifts. Flowers delivery with fewer... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.

DSC 2008 – June 2008, Thessaloniki, Greece Contextual Web Similarity Measuring semantic similarity For given two words their local contexts are extracted from the Web A set of words and their frequencies Semantic similarity is measured as similarity between these local contexts Local contexts are represented as frequency vectors for given set of words Cosine between the frequency vectors in the Euclidean space is calculated

DSC 2008 – June 2008, Thessaloniki, Greece Contextual Web Similarity Example of context words frequencies wordcount fresh217 order204 rose183 delivery165 gift124 welcome98 red87... word: flower wordcount Internet291 PC286 technology252 order185 new174 Web159 site word: computer

DSC 2008 – June 2008, Thessaloniki, Greece Contextual Web Similarity Example of frequency vectors Similarity = cosine(v 1, v 2 ) #wordfreq. 0alias3 1alligator2 2amateur0 3apple zap0 5000zoo6 v 1 : flower #wordfreq. 0alias7 1alligator0 2amateur8 3apple zap3 5000zoo0 v 2 : computer

DSC 2008 – June 2008, Thessaloniki, Greece TF.IDF Weighting TF.IDF (term frequency times inverted document frequency) Statistical measure in information retrieval Shows how important is a certain word for a given document in a set of documents Increases proportionally to the number of word's occurrences in the document Decreases proportionally to the total number of documents containing the word

DSC 2008 – June 2008, Thessaloniki, Greece Reverse Context Lookup Local context extracted from the Web can contain arbitrary parasite words like "online", "home", "search", "click", etc. Internet terms appear in any Web page Such words are not likely to be associated with the target word Example (for the word flowers) "send flowers online", "flowers here", "order flowers here" Will the word "flowers" appear in the local context of "send", "online" and "here"?

DSC 2008 – June 2008, Thessaloniki, Greece Reverse Context Lookup If two words are semantically related, then Both of them should appear in the local contexts of each other Let #{x,y} = number of occurrences of x in the local context of y For any word w and a word from its local context w c, we define their strength of semantic association p(w,w c ) as follows: p(w, w c ) = min{ #(w, w c ), #(w c,w) } We use p(w, w c ) as vector coordinates We introduce a minimal occurrence threshold (e.g. 5) to filter words appearing just by chance

DSC 2008 – June 2008, Thessaloniki, Greece Data Set We use a list of 94 Russian words: Terms extracted from texts in the subject of fine arts Limited to nouns only The data set: There are 50 synonym pairs in these words We expect to find them by our algorithms абрис, адгезия, алмаз, алтарь, амулет, асфальт, беломорит, битум, бородки, ваятель, вермильон,..., шлифовка, штихель, экспрессивность, экспрессия, эстетизм, эстетство

DSC 2008 – June 2008, Thessaloniki, Greece Experiments We tested few modifications of our contextual Web similarity algorithm Basic algorithm (without modifications) TF.IDF weighting Reverse context lookup with different frequency threshold

DSC 2008 – June 2008, Thessaloniki, Greece Experiments RAND – random ordering of all the pairs SIM – the basic algorithm for extraction of semantic similarity from the Web Context size of 3 words Without analyzing the reverse context With lemmatization SIM+TFIDF – modification of the SIM algorithm with TF.IDF weighting REV2, REV3, REV4, REV5, REV6, REV7 – the SIM algorithm + reverse context lookup with frequency thresholds of: 2, 3, 4, 5, 6 and 7

DSC 2008 – June 2008, Thessaloniki, Greece Resources Used We used the following resources: Google Web search engine: extracted the first results for Russian words Russian lemma dictionary: wordforms and lemmata A list of 507 Russian stop words

DSC 2008 – June 2008, Thessaloniki, Greece Evaluation Our algorithms arrange all pairs of words according to their semantic similarity We expect the 50 synonyms pairs to be at the top of the result list We count how many synonyms are found in the top N results (e.g. top 5, top 10, etc.) We measure precision and recall We measure 11pt average precision to evaluate the results

DSC 2008 – June 2008, Thessaloniki, Greece SIM Algorithm – Results nWord 1Words 2 Semantic Similarity Syno- nyms n n 1выжиганиепирография yes100.00%2% 2тонированиетонировка yes100.00%4% 3гематиткровавик yes100.00%6% 4подрамокподрамник yes100.00%8% 5оливинперидот yes100.00%10% 6полированиешлифование no83.33%10% 7полировкашлифовка no71.43%10% 8амулетталисман yes75.00%12% 9пластификаторымягчители yes77.78%14%... Precision and recall obtained by the SIM algorithm

DSC 2008 – June 2008, Thessaloniki, Greece Comparison of the Algorithms Comparison of the algorithms (number of synonyms in the top results) Algorithm Max RAND SIM SIM+TFIDF REV REV REV REV REV REV

DSC 2008 – June 2008, Thessaloniki, Greece Comparison of the Algorithms (11pt Average Precision) Comparing RAND, SIM, SIM+TDIDF and REV2 … REV7 11pt Average Precision 1,15% 58,98% 63,16% n/a 0,00% 10,00% 20,00% 30,00% 40,00% 50,00% 60,00% 70,00% RANDSIMSIM+TFIDFREV2REV3REV4REV5REV6REV7

DSC 2008 – June 2008, Thessaloniki, Greece Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms

DSC 2008 – June 2008, Thessaloniki, Greece Discussion Our approach is original because: Measures automatically semantic similarity Uses the Web as a corpus Does not rely on any preexisting corpora Does not requires semantic resources like WordNet and EuroWordNet Works for any language Tested for Bulgarian and Russian Uses reverse-context lookup and TF.IDF Significant improvement in quality

DSC 2008 – June 2008, Thessaloniki, Greece Discussion Good accuracy, but far away from 100% Known problems of the proposed algorithms: Semantically related words are not always synonyms red – blue wood – pine apple – computer Similar contexts does not always mean similar words (distributional hypothesis) The Web as a corpus introduces noise Google returns the first results only

DSC 2008 – June 2008, Thessaloniki, Greece Discussion Known problems of the proposed algorithms: Google ranks higher news portals, travel agencies and retail sites than books, articles and forum messages Local context always contain noise Working with words, not capturing phrases

DSC 2008 – June 2008, Thessaloniki, Greece Conclusion and Future Work Conclusion Our algorithms can distinguish between synonyms and non-synonyms Accuracy should be improved Future Work Additional techniques to distinguish between synonyms and semantically related words Improve the semantic similarity measure algorithm

DSC 2008 – June 2008, Thessaloniki, Greece References Hearst M. (1991). "Noun Homograph Disambiguation Using Local Context in Large Text Corpora". In Proceedings of the 7th Annual Conference of the University of Waterloo Centre for the New OED and Text Research, Oxford, England, pages Nakov P., Nakov S., Paskaleva E. (2007a). Improved Word Alignments Using the Web as a Corpus. In Proceedings of RANLP'2007, pages , Borovetz, Bulgaria. Nakov S., Nakov P., Paskaleva E. (2007b). Cognate or False Friend? Ask the Web!. In Proceedings of the Workshop on Acquisition and Management of Multilingual Lexicons, held in conjunction with RANLP'2007, pages 55-62, Borovetz, Bulgaria. Sparck-Jones K. (1972). A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation, volume 28, pages Salton G., McGill M. (1983), Introduction to Modern Information Retrieval, McGraw-Hill, New York. Paskaleva E. (2002). Processing Bulgarian and Russian Resources in Unified Format. In Proceedings of the 8th International Scientific Symposium MAPRIAL, Veliko Tarnovo, Bulgaria, pages Harris, Z. (1954). "Distributional structure. Word, 10, pages Lin D. (1998). "Automatic Retrieval and Clustering of Similar Words". In Proceedings of COLING-ACL'98, Montreal, Canada, pages Curran J., Moens M. (2002). "Improvements in Аutomatic Тhesaurus Еxtraction". In Proceedings of the Workshop on Unsupervised Lexical Acquisition, SIGLEX 2002, Philadelphia, USA, pages

DSC 2008 – June 2008, Thessaloniki, Greece References Plas L., Tiedeman J. (2006). "Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity". In Proceedings of COLING/ACL 2006, Sydney, Australia. Och F., Ney H. (2003). "A Systematic Comparison of Various Statistical Alignment Models". Computational Linguistics, 29 (1), Hagiwara М., Ogawa Y., Toyama K. (2007). "Effectiveness of Indirect Dependency for Automatic Synonym Acquisition". In Proceedings of CoSMo 2007 Workshop, held in conjuction with CONTEXT 2007, Roskilde, Denmark. Kilgarriff A., Grefenstette G. (2003). "Introduction to the Special Issue on the Web as Corpus", Computational Linguistics, 29(3):333–347. Inkpen D. (2007). "Near-synonym Choice in an Intelligent Thesaurus". In Proceedings of the NAACL-HLT, New York, USA. Chen H., Lin M., Wei Y. (2006). "Novel Association Measures Using Web Search with Double Checking". In Proceedings of the COLING/ACL 2006, Sydney, Australia, pages Sahami M., Heilman T. (2006). "A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets". In Proceedings of 15th International World Wide Web Conference, Edinburgh, Scotland. Bollegala D., Matsuo Y., Ishizuka M. (2007). "Measuring Semantic Similarity between Words Using Web Search Engines", In Proceedings of the 16th International World Wide Web Conference (WWW2007), Banff, Canada, pages Sanchez D., Moreno A. (2005), "Automatic Discovery of Synonyms and Lexicalizations from the Web". Artificial Intelligence Research and Development, Volume 131, 2005.

DSC 2008 – June 2008, Thessaloniki, Greece Questions? Automatic Acquisition of Synonyms Using the Web as a Corpus