SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
WING Research Group Demos and Posters. Min-Yen Kan, Digital Libraries 22nd CSAIL MIT Workshop Demos SlideSeer (M.-Y. Kan) Coordinating presentation slides.
Evaluating Search Engine
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Hinrich Schütze and Christina Lioma
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Web- and Multimedia-based Information Systems Lecture 2.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
A Rich OPAC User Interface with AJAX Jesse Prabawa Gozali and Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Min’s Research Update WING Group Meeting Min’s research direction NL Work at Stanford.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
A research literature search engine with abbreviation recognition
Data Mining Chapter 6 Search Engines
Introduction to Search Engines
Presentation transcript:

SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore

Min-Yen Kan, Digital Libraries 2Web IR / NLP NUS20 June JCDL: Session E Scholarly Digital Libraries: what do we use them for? Find articles to print, read offline Browse, select research work Assess authors, publication venues, research groups Papers (documents) don’t store all of the information about a discovery: Datasets Tools Implementation details / conditions They also don’t help a person learn the research: Textbooks Slide presentations We’ll focus on this

Min-Yen Kan, Digital Libraries 3Web IR / NLP NUS20 June JCDL: Session E Qualities of slide presentations Good slide sets complement a document. They often: focus and highlight findings in the document create a bridge into the document itself are a visual and oral summary of a document How can we leverage slides in a digital library? “ PowerPoint is presenter-oriented, not content-oriented or audience-oriented…” The remedy?: “Visual reasoning usually works more effectively when the relevant evidence is shown adjacent in space within the eyespan.” (Tufte, 2006) What about poor slides? Four score and seven years ago

Min-Yen Kan, Digital Libraries 4Web IR / NLP NUS20 June JCDL: Session E Documents and presentations as duals Present identical or highly overlapping materials Document: for archival and reference purposes Presentation: for introducing and summarizing the work As the two can be seen as duals, we should allow them to be viewed together. – Would like random access of the presentation and document pair Answer: find pairs of documents and presentations.

Min-Yen Kan, Digital Libraries 5Web IR / NLP NUS20 June JCDL: Session E A model: MIT’s Open CourseWare A better answer: add fine-grained alignment. Slides in context Audio of lecture Simplified transcript of lecture

Min-Yen Kan, Digital Libraries 6Web IR / NLP NUS20 June JCDL: Session E Talk Outline Motivation Architecture 1. Resource Discovery 2. Alignment 3. User Interface Demo Status and Conclusions Resource discovery Converters pdftohtml Searc h Engin e cz-ppt2txtcz-ppt2gif convert Data Store Aligner Web Server Javascri pt- enabled browser OfflineOnline sv dv pv ssv search 1. Resource Discovery 3. User Interface 2. Alignment

Min-Yen Kan, Digital Libraries 7Web IR / NLP NUS20 June JCDL: Session E 1. Resource Discovery Algorithm: Obtain suitable document metadata Web search to find candidate presentations Post process to useable form

Min-Yen Kan, Digital Libraries 8Web IR / NLP NUS20 June JCDL: Session E 1. Resource Discovery – Obtaining Metadata Start with CiteSeer (thanks to IST: CL Giles, I Councill) 750K records with parsed header metadata Complete with.pdf documents Enhancement: Merge DBLP snapshot (Aug 2006; 1.2M docs) with CiteSeer – Large scale record linkage task, O(nm) complexity unacceptable – Indexed DBLP into Lucene, use each CS record to retrieve DBLP variants, resulting in O(n) complexity – Result size: 1.5M

Min-Yen Kan, Digital Libraries 9Web IR / NLP NUS20 June JCDL: Session E 1. Resource Discovery – Finding presentations Google API on title, author to find corresponding presentation Use simple Jaccard similarity threshold to decide matches – threshold λ 3 for title+author similarity CiteSeer + DBLP merge Present- ations DBLP Lucene Index λ2λ2 λ1λ1 λ3λ3 Web filetype: ppt

Min-Yen Kan, Digital Libraries 10Web IR / NLP NUS20 June JCDL: Session E 1. Resource Discovery – Conversion Final results: ~85% precision, recall difficult to calculate (~80%) 11K pairs after processing 200K of 1.5M records Many caveats: only.pdf and.ppt formats currently handled conversion fails often, pdf conversion difficult current work: use OCR to redo text extraction Via pdftohtml - text - formatted text Via czppt2gif/convert - png - text

Min-Yen Kan, Digital Libraries 11Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Problem formulation Q: What are we aligning? A: Text of slides to document text – Use paragraphs to delimit text units in documents – Use document headers to delimit sections Q: What type of alignment is necessary? A: Depends. Presentation or document centered view? – Presentation: 1 slide aligned to 0 to more paragraphs – Document: 1 section aligned to 0 to more slides Q: What’s the approach? A: Two stages: – Basic similarity measure to calculate a similarity matrix – Alignment schemes to establish alignment mapping Similarity Matrix Slides Text Units 1 1 s p Concentrate on this

Min-Yen Kan, Digital Libraries 12Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Related Work 1.Narration to presentation alignment –Usually naturally synchronous: Monotonic alignment 2.Multilingual text alignment –Used in Machine Translation (MT) –Polynomial complexity (~O(n 3 )) but heuristics tend to work well 3.Slide/abstract to document alignment –Use Hidden Markov Model (HMM) for alignment –Doesn’t handle missing materials well. Desiderata: Should take context into account But shouldn’t enforce monotonicity Nil (zero) alignments needed, when materials don’t overlap

Min-Yen Kan, Digital Libraries 13Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Similarity Measures Take text units, cut into tokens. Then calculate similarity using: 1.Cosine –Standard IR metric –TF×IDF for token weight –Calculate slide, paragraph vector similarity using cosine 2.Jaccard –unigram tokens –bigram –unigram + bigram –Use IDF weighting for tokens. For both schemes, use IDF weighting from WebBase corpus

Min-Yen Kan, Digital Libraries 14Web IR / NLP NUS20 June JCDL: Session E 2. Alignment - Schemes 1. Max Similarity – Baseline – Can’t do nil alignment 2. Edit Distance – Efficient dynamic programming – But outputs only monotonic alignments 3. Local Jump Model – Variation on #2 to allow local backward jumps – Backward jumps within 5% of text units – Still doesn’t handle reordered sections 4. Hidden Markov Model – Word-based – Attempts to find origin of s in p – Only handles overlapping information Using matrix of similarity, align using: wjwj s i-5 : … s i-1 : … s i : w j-5 w j-1 w j+1 w j+5 s i+1 : … s i+5 : … p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p6p6 p 1 >p 2 >p 3 >p 4 >p 5 >p 6

Min-Yen Kan, Digital Libraries 15Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Span Extension Idea: post-process to extend from points to spans Retrieve top n (n=10) most sim paragraphs Try all ( n ) possible spans for alignment alignment_score (x,y) = span_sim × ln(span_length) As Maximum Similarity does quite well, let’s extend the algorithm 2 Slightly favor longer spans

Min-Yen Kan, Digital Libraries 16Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Alignment Correction (a) monotonic alignment → ok (b) s i jumps back from s i-1, but then proceeds monotonically → probably ok, minor penalty (c) s i jumps back, but s i+1 jumps back forward → looks more like an error, major penalty applied Final alignment score: alignment_score × (1-penalty) (a)(b)(c) s i-1 s i+1 sisi s i-1 s i+1 sisi s i-1 s i+1 sisi p1p1 p1p1 p1p1 pnpn pnpn pnpn Neighboring alignments can help to correct a spurious one

Min-Yen Kan, Digital Libraries 17Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Nil classifier Use machine learning (SVM) to learn a binary classifier Features 1.Similarity score 2.Number of words on slide Few words can indicate figures, pictures with less preference for alignment 3.Words on slide Cue phrases: “outline”, “questions”, “thanks” 4.Alignment path Jumping alignments (e.g., outline slides) But not all text units should be aligned

Min-Yen Kan, Digital Libraries 18Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Evaluation Dataset Manually compiled alignment dataset by author and fellow researcher Gold standard: annotate all acceptable spans, or nil 20 presentation and document pairs from databases – Dataset is freely downloadable Average number of slides in presentation37.6 Average number of paragraphs in document277.3 Average number of nil (zero) alignments6.6 (17.4%) Average number of span alignments (s, x-y)8.8 (23.4%) Average number of point alignments (s, x)22.2 (59.2%) Total37.6 (100%)

Min-Yen Kan, Digital Libraries 19Web IR / NLP NUS20 June JCDL: Session E 2. Alignment – Evaluation 40%? Why is it so difficult? Noise in conversion process. Other studies have used clean data. Other have used soft accuracy (any overlap is correct) Use Weighted Jaccard accuracy as metric Fractional accuracy for partially correct answers Give false positives (extra spurious alignments) less weight Alignment Method 1. Max Similarity (cosine)33.4% 2. Edit Distance (cosine)28.8% 3. Local Jump (cosine)25.1% 4. Jing HMM28.8% 5. Max Sim + spanning (Jaccard bigram)39.9% 6. Max Sim + spanning + nil classification (Jaccard bigram)41.2% Weighted Jaccard Accuracy

Min-Yen Kan, Digital Libraries 20Web IR / NLP NUS20 June JCDL: Session E 3. User Interface – Rationale Coordinated Views Learning / Comprehension Summarization Offline Viewing Collection Interface Comparing pairs Searching for suitable materials How might fine-grained aligned pairs be utilized in a large DL?

Min-Yen Kan, Digital Libraries 21Web IR / NLP NUS20 June JCDL: Session E 3. UI – Coordinated Views Document View Slide View Slideshow View Full Document View Print View Slide centricDocument centric Gallery View

SlideSeer Prototype Demo Production environment differs from demo

Min-Yen Kan, Digital Libraries 23Web IR / NLP NUS20 June JCDL: Session E 3. UI – Collection Interface Searching –Lucene indexing of the static print view –Show title along with the set of results Spider-friendly –Main content loaded dynamically by Javascript, not spiderable –Currently use print view (as it is static) for spiderable interface URLs –Most material in the form –Implies hierarchy of papers –Constructed URLs to promote browsing access Simple keyboard shortcuts –For expert user navigation

Min-Yen Kan, Digital Libraries 24Web IR / NLP NUS20 June JCDL: Session E Conclusion Alignment of documents to presentations Simple approach works well thus far – Tweaks to get more mileage out of simple approach – Span alignment, nil alignment modifications – But certainly more models to try! – 40% best performance, certainly much room to improve Deployment status – In Alpha (development) – Beta hopefully in mid 2008 – Usability testing underway Interested in digital anthologies? Join our mailing list (web: dAnth) Current: text extraction project for ACL Anthology

Other slides

Min-Yen Kan, Digital Libraries 26Web IR / NLP NUS20 June JCDL: Session E Future Work Planning to hook up current work in progress – 2 stage CRF/SVM re-ranking citation segmentation algorithm – Automatic keyphrase extraction program – Automatic synthetic image classification – Automatic de-duplication module Partnering with Simone Teufel (Cambridge U.) to do argumentative zoning of documents – What is a citation used for?

Min-Yen Kan, Digital Libraries 27Web IR / NLP NUS20 June JCDL: Session E Poor slides Often represent a biased view of the full results – Cherry picking evidence to support claims – Imply that evidence is independent (when it is statistically correlated) – May summarize other findings inaccurately (secondary or tertiary sources