Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Search Engines and Information Retrieval
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
INFO 624 Week 3 Retrieval System Evaluation
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
1 Computing Relevance, Similarity: The Vector Space Model.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
ASSOCIATIVE BROWSING Evaluating 1 Jin Y. Kim / W. Bruce Croft / David Smith by Simulation.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval on the World Wide Web
Presentation transcript:

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia

Who Needs a Language Specific Corpus? Language Technology Applications Language Modeling Speech Recognition Machine Translation Linguistic and Socio-Linguistic Studies Multilingual Retrieval

What Corpora are Available? Explicit, marked up corpora: Linguistic Data Consortium languages [Liebermann and Cieri 1998] Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite - 12 languages Google - 25 languages AltaVista - 25 languages Lycos - 25 languages

BUT what about Slovenian? Or Tagalog? Or Tatar? You’re just out of luck!

The Human Solution Start from Yahoo->Slovenia… Crawl Search on the web, look at documents, modify query, analyze documents, modify query,… Repetitive, time-consuming, requires reasonable familiarity with the language

Task Given: 1 Document in Target Language 1 Other Document (negative example) Access to a Web Search Engine Create a Corpus of the Target Language quickly with no human effort

Algorithm Query GeneratorWWW Seed Docs Language Filter

Web Word Statistics Initial Docs Build Query Filter Relevant Non-Relevant Learning

Query Generation Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military

Query Term Selection Methods Uniform (UN) – select k words randomly from the current vocabulary Term-Frequency (TF) – select top k words ranked according to their frequency Probabilistic TF (PTF) – k words with probability proportional to their frequency

Query Term Selection Methods RTFIDF – top k words according to their rtfidf scores Odds-Ratio (OR) – top k words according to their odds-ratio scores Probabilistic OR (POR) – select k words with probability proportional to their Odds- Ratio scores

3. Language Filter TextCat (van Noord’s implementation) trained on a handful of documents Manually evaluated through sampling 100 Slovenian documents and found to be 99% accurate Contains models for 60 languages

Evaluation Goal: Collect as many relevant documents as possible while minimizing the cost Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query

Experimental Setup Language: Slovenian Initial documents: 1 web page in Slovenian, 1 in English Search engine: Altavista

Results

Results – Precision at 3000 Percentage of Target Docs after 3000 Docs Retrieved

Results – Docs Per Query

Results - Summary In terms of documents: For lengths 1-3, Odds-Ratio works best In terms of queries: Odds-Ratio is consistently better than others Long queries are usually very precise but do not result in a lot of documents (low recall)

Results - Num of Docs Retrieved

Results – Num of Queries

Further Experiments Comparison to Altavista’s “More Like This” Better performance than Altavista’s feature Keywords Similar results when initializing with keywords instead of documents Other Languages Similar results with Croatian, Czech and Tagalog

Comparison with Altavista’s “More Like This” Feature Our Query- Generation mechanism with 5 inclusion and exclusion terms each using Odds-Ratio scoring outperforms Altavista’s “Find Similar Pages” Feature

Effect of Initial Documents No Visible differences until over a 1000 documents have been retrieved TypeLengthVocab Size 1Formal News Informal9074

Initializing with Keywords Obtaining entire documents in a language may not always be possible Used six different sets of 10 keywords obtained from Slovenian speakers as seed documents with Odds-Ratio 3 Each keyword set resulted in performance comparable to using entire documents

Other Languages Tried the same approach with Croatian Tagalog Czech Odds-Ratio outperformed other term- selection methods for all these languages

Other Languages MethodLanguageTarget Docs at 1000 total Docs TF-3Slovenian178 PTF-3Slovenian646 OR-3Slovenian835 TF-3Croatian39 PTF-3Croatian410 OR-3Croatian677 TF-3Czech385 PTF-3Czech451 OR-3Czech743 TF-3Tagalog440 PTF-3Tagalog359 OR-3Tagalog664

Other Languages – Sample Queries Typical queries that most commonly found positive documents in our experiments

Conclusions Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines Not sensitive to initial “seed” documents System and Corpora are/will be available at

Conclusions Automatic Query Generation is a cheap way to collect Language Corpora Odds-Ratio term selection works well Mostly independent of “seed” documents Can be “seeded” with a handful of keywords in a language

Ideas for Future Work Explore other Term-Selection methods From Language specific corpus to Topic Specific corpus as an alternative to focused spidering Finding documents matching a user profile – Personal Agent

Fixed Query Parameters Fix Query Lengths and Vary Term-Selection Methods Fix Term-Selection Methods and Vary Query Lengths Results (Ghani et al., SIGIR 2001): Odds-Ratio works well overall Long Queries are precise but with low recall

Algorithm 1. Initialization 2. Generate query terms from relevant and non- relevant documents 3. Retrieve document using the Query from Use the language filter to add the new document to the relevant or non-relevant set of documents. 5. Update frequencies and scores 6. Return to Step 2.

1. Initialize Given documents in the target and non-target languages Calculate various statistics over the words in each set