Download presentation
Presentation is loading. Please wait.
Published byScarlett Newton Modified over 8 years ago
1
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting
2
Task Corpus ACL Anthology contains a collection of PDF papers Task For each paper P, what papers is cited by P? Gold standard data obtained from Dragomir e.g., P00-1002 ==> P98-2144
3
Header and References Header of paper (HeaderParse) Paper title, author names, etc. Reference section (ParsCit) Paper title, author names, publication venue, etc. Each header and each reference is seen as a record Title, authors, venue
4
System Overview Lucene index Reference record Header records Returned headers Matching algorithm Indexing All fields concatenated into a single string, perform token matching Querying OR matching (default in Lucene)
5
Record Matching TITLEAUTHORVENUETITLEAUTHORVENUE Header recordReference record TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH? Header-reference pair (instance)
6
Experiment Setup Data Reference records: papers divided into training set and test set (50% each) Header records: same set of papers used for training and testing Learning algorithm SMO in Weka (a SVM implementation)
7
Bootstrapping the Training Data Problem Gold standard data specifies mappings at the paper to paper level, but not which reference Solution Hand labeled a small set of reference-header pairs from 6 papers Train a SVM on this small bootstrap set On training set, if gold standard specifies P1 -> P2, then use SVM to classify reference-header pairs of P1 and P2 Retrain SVM using reference-header pairs combined from training and bootstrap sets
8
Experimental Result Used the ACL subset (2176 PDF papers) Skipped: 142 reference sections, 202 paper headers If classifier considers a reference in P1 matches header of P2, then P1 -> P2 Results (on paper to paper mappings) P = 0.901, R = 0.696, F = 0.785 P = 0.898, R = 0.767, F = 0.827 (with manually cleaned header records)
9
Cost-utility Framework cost of acquiring f i utility of acquiring f i feature f i known value value that can be acquired
10
Record Matching TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH? Header-reference pair (instance) [1] Given information [2] Information that can be acquired at a cost Training data Assume all feature-values and their acquisition costs known Testing data Assume [1] known, but feature-values and their acquisition costs in [2] unknown Costs Set to MIN_LEN * MAX_LEN
11
Costs and Utilities Costs Trained 3 models (using M5’), treat as regression Utilities Trained 2^3 = 8 classifiers (each to predict match/mismatch using only known feature-values) For a test instance with a missing feature-value F Get confidence of appropriate classifier without F Get expected confidence of appropriate classifier with F Utility is difference between the two confidence scores Note Similar to Saar-Tsechansky et al.
12
Results Increasing proportion of feature-values acquired Increasing proportion of feature-values acquired Without cleaning of header recordsWith manual cleaning of header records
13
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.