Download presentation
Presentation is loading. Please wait.
1
A Measure of Similarity Between Pairs of Papers Susan Biancani Stanford University School of Education
2
Introduction Long-term goal: Understand changes in scholarly ideas over time Develop a person-person similarity measure, to reflect similarity in bodies of work Short-term goal: Develop a measure of paper-paper similarity 9 features, including metadata and content Train on 120 papers, rated by experts on a 1-7 scale
3
Data 66,000 papers written by professors at Stanford, from the ISI database Features for each pair of papers: Cosine similarity of abstract tf-idf vectors; cosine similarity of title tf-idf vectors Cosine similarity of LDA vectors (3 versions) Count of common references Count of journals referenced in common Count of authors referenced in common Dummy indicating whether the two papers were published in the same journal or not
4
Gold Standard Data 31 papers from 8 professors in Sociology 44 papers from 7 professors in Biology 45 papers from 7 professors in CS Rating Scale: RatingMeaningCount in Training Corpus 1Same paper120 2Highly related134 3Same subfield394 4Related subfields389 5Same discipline1661 6Related disciplines174 7Completely unrelated4385
5
Training & Validation Regression model: rating = β 1 tfidfAbstract + β 2 tfidfTitle + β 3 lda50 + β 4 lda100 + β 5 lda200 + β 6 cites + β 7 citeJournals + β 8 citeAuthors + β 9 sameJournal Ordinal Logistic Regression to learn optimal weights for features Ten-fold cross validation (comparing predicted rating to actual)
6
Results 1 Model Accuracy (6 classes) Accuracy (5 classes) Accuracy (collapsed) rating = tfidfAbstract 63.8363.6566.84 rating = tfidfAbstract + tfidfTitle 64.1163.6267.10 rating = lda50 65.3760.3169.09 rating = lda200 64.8460.8668.29 rating = lda50 + lda200 66.2361.0469.79 rating = lda50 + lda100 + lda200 70.8261.1870.37 rating = sameJournal 61.9160.3164.20 rating = cites 62.4962.3564.71 rating = cites + citeJournal 71.4563.0475.60 rating = cites + citeJournal + citeAuthor 71.2763.1475.49 rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 67.5363.9470.82 rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals 70.8165.1174.90 rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors 70.8164.9674.91 rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal 70.8265.0374.91
7
Results 2 Model Accuracy (all classes) Accuracy (collapsed) SOC ONLY: rating = tfidfAbstract + tfidfTitle 70.9782.80 SOC ONLY: rating = lda50 + lda100 + lda200 64.7375.27 SOC ONLY: rating = cites + citeJournal 73.3386.88 SOC ONLY: rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal 75.7087.96 BIO ONLY: rating = tfidfAbstract + tfidfTitle 59.2076.43 BIO ONLY: rating = lda50 + lda100 + lda200 61.3174.00 BIO ONLY: rating = cites + citeJournal 61.7374.95 BIO ONLY: rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal 63.4271.88 CS ONLY: rating = tfidfAbstract + tfidfTitle 52.0763.09 CS ONLY: rating = lda50 + lda100 + lda200 51.5763.21 CS ONLY: rating = cites + citeJournal 52.5562.24 CS ONLY: rating = tfidfAbstract + tfidfTitle + lda50 + lda100 + lda200 + cites + citeJournals + citeAuthors + sameJournal 57.1467.62
8
Future Directions Improve ratings set. Add more disciplines Confirm ratings with more experts Develop a person-person distance measure, treating each person as the cluster of their papers Apply this measure to the study of paradigm shifts / scientific-intellectual movements Explore the role of organizational structure in these movements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.