Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.

Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting

Task Corpus  ACL Anthology contains a collection of PDF papers Task  For each paper P, what papers is cited by P?  Gold standard data obtained from Dragomir e.g., P00-1002 ==> P98-2144

Header and References Header of paper (HeaderParse)  Paper title, author names, etc. Reference section (ParsCit)  Paper title, author names, publication venue, etc. Each header and each reference is seen as a record  Title, authors, venue

System Overview Lucene index Reference record Header records Returned headers Matching algorithm Indexing All fields concatenated into a single string, perform token matching Querying OR matching (default in Lucene)

Record Matching TITLEAUTHORVENUETITLEAUTHORVENUE Header recordReference record TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH? Header-reference pair (instance)

Experiment Setup Data  Reference records: papers divided into training set and test set (50% each)  Header records: same set of papers used for training and testing Learning algorithm  SMO in Weka (a SVM implementation)

Bootstrapping the Training Data Problem  Gold standard data specifies mappings at the paper to paper level, but not which reference Solution  Hand labeled a small set of reference-header pairs from 6 papers  Train a SVM on this small bootstrap set  On training set, if gold standard specifies P1 -> P2, then use SVM to classify reference-header pairs of P1 and P2  Retrain SVM using reference-header pairs combined from training and bootstrap sets

Experimental Result Used the ACL subset (2176 PDF papers)  Skipped: 142 reference sections, 202 paper headers If classifier considers a reference in P1 matches header of P2, then P1 -> P2 Results (on paper to paper mappings)  P = 0.901, R = 0.696, F = 0.785  P = 0.898, R = 0.767, F = 0.827 (with manually cleaned header records)

Cost-utility Framework cost of acquiring f i utility of acquiring f i feature f i known value value that can be acquired

Record Matching TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN TITLE_SIM AUTHOR_SIM VENUE_SIM MATCH/MISMATCH? Header-reference pair (instance) [1] Given information [2] Information that can be acquired at a cost Training data Assume all feature-values and their acquisition costs known Testing data Assume [1] known, but feature-values and their acquisition costs in [2] unknown Costs Set to MIN_LEN * MAX_LEN

Costs and Utilities Costs  Trained 3 models (using M5’), treat as regression Utilities  Trained 2^3 = 8 classifiers (each to predict match/mismatch using only known feature-values)  For a test instance with a missing feature-value F Get confidence of appropriate classifier without F Get expected confidence of appropriate classifier with F Utility is difference between the two confidence scores Note  Similar to Saar-Tsechansky et al.

Results Increasing proportion of feature-values acquired Increasing proportion of feature-values acquired Without cleaning of header recordsWith manual cleaning of header records

Thank You

Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.

Similar presentations

Presentation on theme: "Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.

Similar presentations

Presentation on theme: "Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting."— Presentation transcript:

Similar presentations

About project

Feedback