© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case #08-1697. Alexander Yeh MITRE Corp. October.

Slides:



Advertisements
Similar presentations
©2012 by Gospel Publishing House, 1445 N. Boonville Ave., Springfield, Missouri. All rights reserved. Permission to replicate for church use only and may.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Large-Scale Entity-Based Online Social Network Profile Linkage.
A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,
Student Organization Training 2008 Why put it on TAP? More convenient than the old Blue Form system 24 Access from any browser location Who may use TAP?
AUTOMATICALLY CITE YOUR SOURCES FOR FREE AT
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Recruitment Talk The Hong Kong University of Science and Technology (HKUST) Date: Feb 16, 2005 Speaker: Antonio Yu (Resources Explorer)
Evaluating Search Engine
Project Summary Everybody’s Google is a web browser extension which mines personalized Google search results and redistributes them to extension users.
Designing Software for Personal Music Management and Access Frank Shipman & Konstantinos Meintanis Department of Computer Science Texas A&M University.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Hidden Apps Carrier IQ and Privacy in Mobile Devices.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
John M. Abowd Cornell University IASSIST 2010 June 4, 2010.
1 of 7 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Business Ownerships MARKETING DYNAMICS UNIT 1 LESSON 2 Copyright © Texas Education Agency, All rights reserved.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Chinese writing and language
PLP Guide1 Training Guide for Inzalo PLP Management.
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
CAREERS SITE LOG-IN INSTRUCTIONS SEPTEMBER CAREER SITE LOG-IN INSTRUCTIONS As of September 8, 2015, Vectrus has moved to a new Applicant Tracking.
Generating Intelligent Links to Web Pages by Mining Access Patterns of Individuals and the Community Benjamin Lambert Omid Fatemieh CS598CXZ Spring 2005.
Making the most of social historic data Aleksander Kolcz Twitter, Inc.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Tag Data and Personalized Information Retrieval 1.
Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
“Copyright and Terms of Service Copyright © Texas Education Agency. The materials found on this website are copyrighted © and trademarked ™ as the property.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Source-Selection-Free Transfer Learning
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
A Statistical Comparison of Tag and Query Logs Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Thomas Mandl: GeoCLEF Track Overview Cross-Language Evaluation Forum (CLEF) Thomas Mandl, (U. Hildesheim) 8 th Workshop.
Ian F. C. Smith Writing a Conference Paper. 2 Disclaimer This is mostly opinion. Suggestions are incomplete. There are other strategies.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
学业进步 金價: 9999 Content Introduction Educational Aspect -situation -advantages -disadvantages -result of questionnaire -conclusion Tourism Aspect.
Evaluating Web Sources Presenting by Amal Alhamal Ayra Almeida Zarina Sushina.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
Passive Generalizations Li, Charles N. & Thompson, Sandra A. (1981). Mandarin Chinese - A Functional Reference Grammar. Los Angeles: University of California.
Usual Language Number% of totalNumber% of totalNumber% of totalNumber% of total Cantonese
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
IllinoisJobLink.com Training Video Creating a Resume Copyright © 2015, America’s Job Link Alliance–Technical Support (AJLA–TS) All rights reserved. This.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
Lesson 1: What is Excel. Microsoft Office Excel 2003  Excel is a powerful spreadsheet programs that allows users to organize data, complete calculations,
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Timo Unger Background & Functionality IMDS Analytics.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Evaluation Anisio Lacerda.
Did You Know That??? 你知道吗 ??? ----Mei Xiang Sunnyside Middle School.
Big-Data Fundamentals
Presentation transcript:

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Alexander Yeh MITRE Corp. October 2008 Potential Query Log Sets

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Possible Issues with a "Query Log" Corpus Resembles queries of real interest to somebody Has some 'geo' aspect Multi-lingual - Mitre in-house has limitations on languages Permission to use and distribute (even after the evaluation)

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # More Recent Suggestions (While at Workshop) Local search queries from various Wikipedias -Multi-lingual -Privacy? -probably not as bad as other search logs (more like encyclopedia lookup) -Permission? -Long enough to be interesting from a "geo" standpoint?

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # More Recent Suggestions (Continued) Treat GikiP topics as queries E.g.: GP4 "Which Swiss cantons border Germany?” -Multi-lingual, have permission, no privacy problem -Combine with GikiP 2009 for publicity purposes -But few in number (15 in 2008 pilot) -Realistic enough? Use logs generated by an evaluation (like iCLEF) -Multi-lingual, permissions & privacy dealt with -But realistic enough? -Has "geo" aspect?

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # More Recent Suggestions (Concluded) Timway search logs from Hong Kong -Chinese, English, usually 1 language in a query -Used in some studies, but usual permission & privacy issues -Also, finding annotator(s) may be an issue: Chinese probably in Cantonese (versus "official" Mandarin dialect) - not too bad in written form Probably traditional characters (not mainland China’s simplified characters)

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Potential Query Log Data Sets - 1 Tumba! (Diana Santos, Nuno Cardoso and others) -Available, large amount, a lot not released before -In Portuguese: need to hire and train somebody who can annotate Portuguese

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Potential Query Log Data Sets - 2 Workshop on Web Search Click Data 2009 (WSCD 2009) - scd09/ -MSN search query log -Large amount, relatively new (and so not seen as much) -Pursuing getting permission (asking Nick Craswell) Cancelled query parsing task in CLEF Current status: cannot release data outside of Microsoft

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Potential Query Log Data Sets - 3 Query parsing task in CLEF Query log of 800K English queries (unlabeled), 100 queries of labeled training data and 500 queries of test data -Presumably this log is still available for use in a new query parsing task. -Use same set, but generate new training and test -One disadvantage: the CLEF community is already familiar with this data set

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Can Easily Obtain the Following Query Log Data Sets, But … Can easily obtain a number of data-sets, but -They are old, and so may have been already seen by the CLEF community -Problems getting permissions to use these Anticipate problems, or Been asked not to use

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Query Log Data Sets that are Easy to Obtain KDD Cup 2005: Ying Li, a co-chair, asked us not to use AlltheWeb_2001.gz, AlltheWeb_2002.gz, AltaVista_2002.zip: Jim Jansen: the data sharing agreement has expired Excite_1997_small.zip, Excite_1997_large.zip, Excite_1999.zip, Excite_2001.gz: from Jim Jansen. Need Excite's permission?

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Query Log Data Sets that are Easy to Obtain (Concluded) AOL query log: from -Was made available to the public for awhile -Created a controversy about privacy But all these data sets will have similar privacy issues

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # A Way to ‘Use' these Data-Sets (John Burger): Use the existing logs as 'inspiration' for a made- up log corpus -May have been done by others, like NIST -Will not need permission -Will not have been seen before -Can insure no privacy disclosures -But will take time to produce the made-up data

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Privacy Concerns Though most well known with the AOL query logs, all these data sets may contain private data -One way to 'remove': use the existing logs as 'inspiration' for a made-up log corpus (mentioned above) -A fast, incomplete way to remove private data: remove the query timestamps and links indicating which queries came from the same site and randomize the order of the queries A lot of the 'disclosures' comes from grouping the queries to a common source But the removed information is now not available to a query parser

© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Privacy Concerns (Concluded) -A slower, more complete way to remove private data: review the data (perhaps as it is annotated) and flag any ones with private data Either substitute the flagged data with fictional information or remove the queries with flags from the data sets