Citation Provenance FYP/Research Update WING Meeting 28 Sept 2012 Heng Low Wee 1/5/2016 1.

Slides:



Advertisements
Similar presentations
On-line learning and Boosting
Advertisements

Catching the Drift: Learning Broad Matches from Clickthrough Data Sonal Gupta, Mikhail Bilenko, Matthew Richardson University of Texas at Austin, Microsoft.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.
ITEC810 Final Report Inferring Document Structure Wieyen Lin/ Supervised by Jette Viethen.
A Measure of Similarity Between Pairs of Papers Susan Biancani Stanford University School of Education.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
Ch 4: Information Retrieval and Text Mining
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Use Case Modelling Visual Annotator for studying ICU Notes Bacchus Beale.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
The identification of interesting web sites Presented by Xiaoshu Cai.
Text Classification, Active/Interactive learning.
The Computational Linguistics Summarization Pilot TAC 2014 Kokil Jaidka †, Muthu Kumar Chandrasekaran* ‡, Min-Yen Kan* ‡, Ankur Khanna ‡ Nanyang.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.
1 Literature review. 2 When you may write a literature review As an assignment For a report or thesis (e.g. for senior project) As a graduate student.
Active Learning on Spatial Data Christine Körner Fraunhofer AIS, Uni Bonn.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.
Reconstructing shredded documents through feature matching Authors: Edson Justino, Luiz S. Oliveira, Cinthia Freitas Source: Forensic Science International.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Semi-Automatic Image Annotation Liu Wenyin, Susan Dumais, Yanfeng Sun, HongJiang Zhang, Mary Czerwinski and Brent Field Microsoft Research.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Water Levels GROUP 2 Mina Zirlott Nidia Selwyn Christopher Edgar Melissa Lau Yusong Liu Please fill in the check-list in the next slides and sent the completed.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.
UROP Research Update Citation Function Classification Eric Yulianto A B 22 February 2013.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Martin Rajman, Martin Vesely
Classification Nearest Neighbor
Improving DevOps and QA efficiency using machine learning and NLP methods Omer Sagi May 2018.
Learning Literature Search Models from Citation Behavior
Body of Paper (5 paragraphs)
Citation-based Extraction of Core Contents from Biomedical Articles
Family History Technology Workshop
Introduction to Information Retrieval
Parenthetical Documentation
Title Introduction: Discussion & Conclusion: Methods & Results:
Presentation transcript:

Citation Provenance FYP/Research Update WING Meeting 28 Sept 2012 Heng Low Wee 1/5/2016 1

Previous Update  Motivation  Reading experience; Interrupts reading when looking up cited paper.  Goal: Predict type of citation; Location of cited information  Problem Analysis  General/Specific citations  Citing context as query, fragments of cited paper as ‘documents’ to be matched 1/5/2016 2

Previous Update (Continued)  Corpus  ACL Anthology Reference Corpus, processed with ParsCit to extract citing contexts, and fragments of cited paper  Approach  1 st feature considered: Cosine Similarity  Annotations 1/5/2016 3

Outline  Previous Update  Features Added  Annotating Data  Initial Testing  Analysis  What’s Next? 1/5/2016 4

Features Added  Citation Density  The no. of inline citations / no. of lines in the context  Intuition: High density hints it is a general citation  (Dong & Schafer, 2011) [ACL I ]  Difference in Publishing Year  Intuition: Large difference suggests citing older and fundamental work; less discussion on citing paper thus general citation 1/5/2016 5

Features Added  Location of Inline Citation  The section in which the inline citation belongs  Intuition: If located in Introduction, suggests general citation  (Dong & Schafer, 2011) [ACL I ]  Title Overlap & Author Overlap  Jaccard distance between citing’s and cited’s  Intuition: Similar titles suggests closely related work, refers to cited for specific contributions; Same authors hints closely related work 1/5/2016 6

Features Added  Average TF-IDF weight for contexts and fragments in cited paper  Intuition: Specific citations refer to ‘high valued’ terms in cited paper  Cosine Similarity 1/5/2016 7

Annotating Data  Previous scheme  Annotate plain text file using labels + line number range  Annotating by line range  difficult to determine whether prediction matches annotation because they are not discrete  Annotation task is very challenging  4 annotation labels  General (0), Specific-Yes (1), Specific-No (2), Undetermined (3)  For each citing context in citing paper, for each text block in cited paper: annotate with label 1/5/2016 8

Annotating Data 1/5/ Citing Cited : : L1L1 L2L2 LjLj LnLn : :

Annotating Data  Currently:  6632 annotated records  ~62% General, ~3% Specific-Yes, ~34% Specific-No, ~0.6% Undetermined  Undetermined data points are removed; Specific-No data points are regarded as General  Reduced to binary classification 1/5/

Initial Testing  90% train; 10% test; SVC; 1 iteration 1/5/ – General, 1 – Specific-Yes

Analysis  Unable to predict any ‘Specific-Yes’  Number of ‘Yes’ instances too little.  Feature set unable to distinguish General vs Specific 1/5/

What’s Next  To investigate further: where and how specific citations are made  Features that can better distinguish general vs specific citations 1/5/