Download presentation
Presentation is loading. Please wait.
Published byRaymond Shaw Modified over 9 years ago
1
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li
2
2 Introduction to CiteSeer Software package developed at NEC-Labs Domain Independent Software for Automatic Citation Indexing (ACI) Focus is on scholarly publications in electronic format (PS / PDF and variants) Performs: –Document Discovery / Retrieval / Parsing –Automatic Citation Extraction –Document & Citation Indexing / Search
3
3
4
4 Crawler Retrieval Conversion Parsing & Meta-Data Extraction Meta-Data Database PDBM_File & Chunk Tables Indexing Web Server Indexes C D Document Database File System Document (Plain Text) Document Meta-data Set DID Title Authors etc. Document Body Text N Citation Texts Document (PDF/PS) Document URL N Citation Meta-data Sets CID GID Title Authors etc.
5
5 Submitting Documents Output of Crawl / User Submission is URL of page linking to document. These URLs are dumped in Paper Table Paper Table maintains status for each document: –Downloaded/undownloaded –Processed/unprocessed –Other processing errors (tooshort/noreference/etc.) CiteSeer regularly scans this table to start download of new documents Only Documents meeting typical pattern of scholarly publications are eventually added to the collection
6
6 Document Structure Identification –Title –Subject (keywords) –Description (abstract) –Author names –Author affiliations –Author address, email, phone, Homepage URL –Publication date, Publication number –Archive date –Contributor –Type –Format –Identifier –Source –Publisher –Journal/Conference –Pages –Relation References Is Referenced By From document header System info From citation graph
7
7 Citations grouping Citations to same document have common Group ID –Each Group ID has a set of keys associated to it, based on citation information –{authorkey1-titlekey; … authorkey2-titlekey} For every single word in the authors information there is an authorkey For a given citation, titlekey is unique and is concatenation of all title words
8
8 Citations Grouping For newly discovered citation –Extract Authors : C. Lee Giles, S. Lawrence Title : “Good Paper Title” –Generate keys {giles-goodpapertitle; lee-goodpapertitle; lawrence-goodpapertitle} –Try to match at least one of them with existing Group ID key If there is a match, add this citation (Citation ID) to the group Otherwise create a new Group ID for this citation
9
9 Linking Citations to Documents Citation ID->Group ID –We just saw that … Document ID->Group ID –Based on document’s metadata, generate authorkey-titlekey in the same way and try to match a Group ID key generated from the citations –Document metadata can be erroneous, so successful mapping often happens AFTER correction by users
10
10 Problems of the Current Approach There is no guarantee that the most similar citation contains the best metadata Building citation graph is a time-intensive, offline task Due to batch clustering, the addition of a single citation requires rebuilding the entire citation graph to include the new instance The so-called canonical metadata is fixed to the document record
11
11 Goals of the New Citation Management System Provide better document metadata Reduce the cost of maintenance Use on-line citation matching such that the citation graph environment can be adjusted immediately based on a single new citation Provide a fluid framework for building canonical metadata in which all evidence is always considered Allow the development of flexible APIs into CiteSeer citation graph system Maintain data security despite an open, wiki-like approach to user-contributed metadata changes Provide better citation matching compared to the current system
12
12 Prototype Overview Document Metadata Index Citation Metadata Index Citation Resolver Citation Metadata (XML) Document Metadata (XML) Query Handler Edge DB (SQL) Query May ultimately be located in separate service
13
13 Edge DB One simple table containing one edge per row: –Id: citation handle (equivalent to CID) –citingDoc: citing document handle –citedDoc: cited document handle Row-level locking
14
14 Matching citations and docs Exact string match across disparate metadata fields way too optimistic - need better matching criteria Lucene provides two methods out of the box: –Match based on Levenshtein distance Specify arbitrary distance cut-off per field choose most similar match out of returned set –Cut out the middleman - similarity-based matching Specify arbitrary similarity threshold Choose most similar match out of return set over threshold Criteria to be determined through empirical tests using prototype system.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.