A Measure of Similarity Between Pairs of Papers Susan Biancani Stanford University School of Education
Introduction Long-term goal: Understand changes in scholarly ideas over time Develop a person-person similarity measure, to reflect similarity in bodies of work Short-term goal: Develop a measure of paper-paper similarity 9 features, including metadata and content Train on 120 papers, rated by experts on a 1-7 scale
Data 66,000 papers written by professors at Stanford, from the ISI database Features for each pair of papers: Cosine similarity of abstract tf-idf vectors; cosine similarity of title tf-idf vectors Cosine similarity of LDA vectors (3 versions) Count of common references Count of journals referenced in common Count of authors referenced in common Dummy indicating whether the two papers were published in the same journal or not
Gold Standard Data 31 papers from 8 professors in Sociology 44 papers from 7 professors in Biology 45 papers from 7 professors in CS Rating Scale: RatingMeaningCount in Training Corpus 1Same paper120 2Highly related134 3Same subfield394 4Related subfields389 5Same discipline1661 6Related disciplines174 7Completely unrelated4385
Training & Validation Regression model: rating = β 1 tfidfAbstract + β 2 tfidfTitle + β 3 lda50 + β 4 lda100 + β 5 lda200 + β 6 cites + β 7 citeJournals + β 8 citeAuthors + β 9 sameJournal Ordinal Logistic Regression to learn optimal weights for features Ten-fold cross validation (comparing predicted rating to actual)
Results 1 Model Accuracy (6 classes) Accuracy (5 classes) Accuracy (collapsed) rating = tfidfAbstract rating = tfidfAbstract + tfidfTitle rating = lda rating = lda rating = lda50 + lda rating = lda50 + lda100 + lda rating = sameJournal rating = cites rating = cites + citeJournal rating = cites + citeJournal + citeAuthor rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal
Results 2 Model Accuracy (all classes) Accuracy (collapsed) SOC ONLY: rating = tfidfAbstract + tfidfTitle SOC ONLY: rating = lda50 + lda100 + lda SOC ONLY: rating = cites + citeJournal SOC ONLY: rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal BIO ONLY: rating = tfidfAbstract + tfidfTitle BIO ONLY: rating = lda50 + lda100 + lda BIO ONLY: rating = cites + citeJournal BIO ONLY: rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal CS ONLY: rating = tfidfAbstract + tfidfTitle CS ONLY: rating = lda50 + lda100 + lda CS ONLY: rating = cites + citeJournal CS ONLY: rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal
Future Directions Improve ratings set. Add more disciplines Confirm ratings with more experts Develop a person-person distance measure, treating each person as the cluster of their papers Apply this measure to the study of paradigm shifts / scientific-intellectual movements Explore the role of organizational structure in these movements