Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Slides:

Advertisements

Similar presentations

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.

Advertisements

Chapter 5: Introduction to Information Retrieval

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

 How many pages does it search?  How does it access all those pages?  How does it give us an answer so quickly?  How does it give us such accurate.

H YPERLINKING DIGITAL LIBRARIES ON THE WEB Juan Camilo Zapata ITEC – 810 Supervisor Robert Dale 1.

Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.

(c) Maria Indrawan Distributed Information Retrieval.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.

Chapter 5: Information Retrieval and Web Search

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.

Databases & Data Warehouses Chapter 3 Database Processing.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.

Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Which of the two appears simple to you? 1 2.

2008 International Workshop on Web and Databases (WebDB) Efficient Web-Based Linkage of Short to Long Forms Yee Fan Tan 1, Ergin Elmacioglu 2, Min-Yen.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Chapter 6: Information Retrieval and Web Search

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Question Answering over Implicitly Structured Web Content

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.

Information Retrieval

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.

Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,

Search Engine Optimization

Information Retrieval in Practice

A Deep Learning Technical Paper Recommender System

Data Mining Chapter 6 Search Engines

Searching with context

Introduction to Information Retrieval

Panagiotis G. Ipeirotis Luis Gravano

Information Retrieval and Web Design

Introduction to Search Engines

Presentation transcript:

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1

Motivation More articles  more users Searching for documents is difficult Aim: Find pairs of presentations and documents automatically 10/23/2015 2

System Architecture Search Engine Wrapper Query “File type”( PDF, PPT or PS) operator is added with the user query Before sending it to Google. Re-Ranking Top results of Google Output:(3-way) 1.Exact URL 2.Message for No-free files 3.No result Google Used Yee Fan’s Search Engine Wrapper – just Google subsystem 10/23/2015 3

Methodology (1) Re-Ranking Computed similarity between user query and documents retrieved for re-ranking. Methods used for computing similarity are Jaccard co- efficient, Bilingual Evaluation Understudy (BLEU). Threshold value is used to restrict the system from considering low similarity scored documents. Google’s Top Results Similarity Score Computation Re-Ranking Results Based on Similarity Score Similarity is computer between Query Title and each Google’s result Title, Snippet, URL. 10/23/2015 4

Jaccard Measure Jaccard measure is used to compute similarity between Query Title and Google’s result Title, Snippet, and URL. Simple word by word matching. Problems are: Snippets have more words than title. Union in Jaccard increases while intersection remains same. Sentence1: Finding related pages in the world wide web. Sentence2: Finding Related pages using the Link structure of the WWW. 10/23/2015 5

BLEU metric Why BLEU?? n-gram similarity of words. Helps in accessing the sequential order of the words when finding similarity between two sets. Sequential order of words matters with snippet  query terms may appear in a random position. 10/23/2015 6

Rules Special rules are used for better matching: Rule1: Removing special symbols. (On/Off) Rule2: Stop-words removal (On/Off) Rule3: URL filter by.edu (On/Off) Rule4: Stemming (Porter stemming algorithm) (On/Off) All these rules are used with both the methodologies. 10/23/2015 7

Methodology MIME-types: To differentiate free PDF from subscription type, I used the MIME-types. It returns the content-type of the URL. Dataset collection: Queries from, Computer science. Medical science. Architecture. Mathematics. 10/23/2015 8

Experiment Experiments on – Jaccard Measure.(All special rules are tested with On/Off). – BLEU measure (All special rules are tested with On/Off). – Query set with about 50 queries. – Threshold is set from 0.1 to 1.0 range for all experiments. – Highest recall with high threshold is considered. Experiment results – Jaccard similarity. – BLEU similarity. 10/23/2015 9

Experiment result of Jaccard RuleGoogle Target ThresholdPrecisionRecallF-Score 1234 OnOff Title OffOnOff Title Off OnOffTitle Off OnTitle On Off Title OffOn OffTitle Off On Title OnOff OnTitle On OffTitle OffOn Title OnOffOn Title On OffOnTitle On Title Off Title OnOffOnOffTitle OffOnOffOnTitle Best F-score achieved 10/23/

Best F- score achieved Experiment result of BLEU 10/23/ RuleGoogle Target ThresholdPrecisionRecallF-Score OnOff Title OffOnOff Title Off OnOffTitle Off OnTitle On Off Title OffOn OffTitle Off On Title OnOff OnTitle On OffTitle OffOn Title OnOffOn Title On OffOnTitle On Title Off Title OnOffOnOffTitle OffOnOffOnTitle

Related Work Base Reference: – SlideSeer: a digital library of aligned document and presentation pairs, [Kan, JCDL’07]. – Learning to Rank for Information Retrieval. [Liu et al., WWW’09]. – Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites.[Hänse, ICADL’09] Approaches to Similarity Computation – BLEU: a Method for Automatic Evaluation of Machine Translation. [Papineni et al., ACL July’02]. – BLEU algorithm for evaluation machine translations implementation.[Payson et al.] 10/23/

Conclusion Matching documents based on similarity score Jaccard measure -- Jaccard similarity computed over Query title and Document title with rule special symbol removed retrieves best articles. -- Threshold: F-score: BLEU metric -- BLEU similarity computed over Query title and Document title with rule special symbol removed retrieves best articles. -- Threshold F-score: /23/

Thank you Comments are welcome 10/23/