CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester, MA, USA 2 University of Wisconsin Milwaukee, Milwaukee, Milwaukee, WI, USA 3 VA Central Massachusetts, Leeds, MA, USA
Outline Introduction Background Method Evaluation Analysis CiteGraph, MedInfo 2013
Introduction Citation network is important for Information retrieval Journal Impact Factor, H-index Co-authorship network is important Few citation networks are available for research We built CiteGraph CiteGraph, MedInfo 2013
Background Citation network analysis Power law distribution in citation networks Article ranking, HITS and PageRank Community structure of physics fields Citation network tool for given legal issue using legal document citation network Co-authorship network analysis Research collaboration patterns Author authority : Erdös Number Literature search CiteSeer X, Google Scholar CiteGraph, MedInfo 2013
The CiteGraph Data CiteGraph, MedInfo 2013
Citation Network Example CiteGraph, MedInfo 2013
Challenges CiteGraph, MedInfo 2013 (1)Yu, H and Lee M Accessing Bioscience Images from Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e556. (2) Hong Yu and Minsuk Lee. Accessing Bioscience Images from Abstract Sentences. Bioinformatics. Vol 22 No. 14, pages e547–e (3) Yu H, Lee H Accessing Bioscience Images from Abstract Sentences. Bioinformatics: 22 (14), e547–e556.
Methods Mapping between articles Mapping articles to the PubMed ID Author name disambiguation CiteGraph, MedInfo 2013
Methods If two of the following matching result are true, we consider the two entities (for example the citation and the article) are matched Title matching the set of tokens contained in one title field is a subset of the tokens in the other, or the number of tokens common to both fields is more than 80% of the size of the larger of the two fields. Author list matching two lists of surnames have one-on-one mapping surnames in one entity (citation) is fully contained in the surname set of the second (article). Journal name matching remove stop words such as “of” if the number of common initials in the journal titles was greater than 80% of the tokens in the longer journal name, they were considered equivalent.
Evaluation Results TaskPrecisionRecallF1Inter-Annotator Agreement (Kappa) Citation Mapping PMID Mapping CiteGraph, MedInfo Annotators are invited to annotate the citation mapping and PMID mapping results Each annotator is presented with 20 matching results of each task
The CiteGraph Statistics CiteGraph, MedInfo M articles 6.35 M citations 1.37 M authors
The CiteGraph Statistics CiteGraph, MedInfo 2013 log y = 1.06 – 2.45* log x (p<0.05 t-test) Livak KJ., Schmittgen TD., Analysis of relative gene expression data using real- time quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods Dec;25(4):402-8.
The CiteGraph Statistics CiteGraph, MedInfo 2013 Largest connected component : 1.27 million authors (92.7%) The second largest connected component: 35 authors
The CiteGraph Statistics CiteGraph, MedInfo 2013 Co-authorship spans from 1 to 35 years, while 83.7% of author pairs just appear once.
The CiteGraph Statistics CiteGraph, MedInfo 2013 MeasureMeanMedianStdMaxMin # of Co-authors Co-authorship Year Span * The largest component is excluded when calculating the statistics in the table. Its size is 1.27 million (92.7% authors)
Trends CiteGraph, MedInfo 2013
Conclusion We created a citation/co-authorship networks with biomedical full text literature Our networks have high accuracy and large scale, and it can benefit biomedical text mining communities Article ranking Research collaboration recommendation Social network analysis The network database can be downloaded per request CiteGraph, MedInfo 2013
Acknowledgement National Institute of Health 1R01GM to Hong Yu A start-up fund from University of Massachusetts Medical School to Hong Yu National Center for Advancing Translational Sciences of the National Institute of Health under award number UL1TR CiteGraph, MedInfo 2013