Download presentation
Presentation is loading. Please wait.
Published byAugust Cobb Modified over 9 years ago
1
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li
2
Outline Introduction Matching Citations and documents Learning from Observations Cluster Repair Evaluation
3
Introduction In research repositories, citations represent important knowledge regarding work contexts. The citation relationships form a data structure generally known as a “citation graph”, where documents are vertices and citations are directed edges between citing and cited documents. The methods to construction citation graphs –manual information extraction –autonomous citation indexing (ACI)
4
Introduction Popular ACI systems –The CiteSeer Digital Library (a collection of over 725,000 documents with over 8 million citations ) –Google Scholar (433 million document and citation records ) Typical ACI process –Extract citations from research papers –Parse subfields to build accurate metadata for each citation –Link citations to documents
5
Introduction Typical problems in the ACI process –The citation parsers error-prone and often produce noisy results –Errors in the citation text (such as typos) –Identity uncertainty in document matching –For an automatic DL system, the identity of documents is uncertain (canonical metadata of the document can be incomplete or inaccurate) In such cases, citation metadata can be used to correct the canonical metadata of documents
6
Our Research Goals Provide better document metadata Reduce the cost of maintenance Allow the development of flexible APIs into CiteSeer citation graph system Maintain data security despite an open, wiki-like approach to user-contributed metadata changes Provide better citation matching compared to the current system
7
Matching Citations and documents Current offline approach –Citations are grouped according to their extracted metadata –The citation group is linked to a real document in the repository (exist inside the ACI system and yet not collected)
8
Matching Citations and documents Remember citations are themselves documents Treat citations and documents differently brings a lot of unnecessary complications into the system Citations pointing to a document in the ACI system can be represented by the document’s identity To represent the document which a citation points to and not in the current system, we use the notion of “virtual document”, which takes on the extracted metadata of the citation.
9
Matching Citations and documents Once the document enters the system, the corresponding virtual record is then updated with a pointer to the document file, making it a “real” document record. There are no citation edges pointing to an external unknown resource. All edges are internal in the document database and “real” and “virtual” documents can be searched in the same index space. We use Lucene to match documents online.
10
Learning from Observations A problem of generating beliefs in the identity of a document based on observational evidence. records may be linked with many information sources –Extracted document metadata –Extracted citations –External records (from DBLP, ACM) –User correction We focus on metadata elements with small variability in correct representations, such as names, titles, dates, etc.
11
Learning from Observations We use Bayesian Belief network to construct canonical metadata –Decide the canonical value X from all observations on X. –Each network BEL(X) is to develop degrees of belief in each possible value X, and X is chosen based on the value with the largest belief score. –Given a prior belief vector BEL(x), BEL(x) can be updated with a new observation ox using only a local computation.
12
Learning from Observations An example –An example observation vector o(x) may be (0, 0, 1, 0), indicating that o(2) is the observed value for x. –This vector must then be adjusted based on our confidence in the observation. This is achieved using a confidence matrix –assigning C=0.7 to o results in an actual message of (0.1,0.1,0.7,0.1) sent to X.
13
Cluster Repair Adjusting metadata dynamically in response to new evidence can lead to inconsistencies in citation groups. repairCluster(R) Find matching citations M for R For each citation C in GR If C is not contained in M Add C to REVOKE Set GR = M Reset belief vectors For each citation C in GR If C is not contained in REVOKE Update belief vectors using C If metadata changes repairCluster(R)
14
Cluster Repair Voting privilege –To prevent unbounded iterations, once a citation C1 is removed from GR, it can return to GR but it cannot influence metadata belief vectors for the remainder of the repairCluster iterations. –At the end of a repairCluster call stack, the non-voting citations regain voting privileges.
15
Evaluation Ten frequently referenced document records were selected from the top of CiteSeer’s most-cited document list along with all corresponding citations. 9,121 citations were used in the final test set. the data set was run through a noise generation program to purposely add some noise into the citation records. –Randomly insert a word into the title. –Randomly delete a word from the title. –Randomly insert an author name. –Randomly delete an author name. –Randomly misspell a word in the title. –Randomly misspell an author name. –Mistakes in the publication year attribute. Corresponding parameters are provided to control the probability with which a certain category of noise will occur, varying from 0 to 1. A noise rate of 0 means the original version of citation texts are adopted, without any intended modifications. A noise rate of 1 means a type of noise is destined to happen.
16
Index-Based Citation Clustering Lucene’s fuzzy query is utilized to match citations to documents. We vary the similarity threshold to observe the precision and recall
17
Index-Based Citation Clustering Noise is introduced into the citation data to test the capability of the matching algorithm to handle inaccurate inputs.
18
Metadata Determination and Cluster Repair Confidence in the document metadata was arbitrarily set at 0.8, and confidence in citation data was set at 0.5. The cluster repair algorithm was then used to iteratively query the citation index and repair the document’s metadata until convergence. Only title, author, and year metadata was tested for accuracy.
19
Metadata Determination and Cluster Repair
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.