Presentation is loading. Please wait.

Presentation is loading. Please wait.

Author Name Disambiguation for Citations Using Topic and Web Correlation Citation : a collection of: coauthor, title, venue, topic, and Web attributes.

Similar presentations


Presentation on theme: "Author Name Disambiguation for Citations Using Topic and Web Correlation Citation : a collection of: coauthor, title, venue, topic, and Web attributes."— Presentation transcript:

1 Author Name Disambiguation for Citations Using Topic and Web Correlation
Citation : a collection of: coauthor, title, venue, topic, and Web attributes.

2 Prior work Supervised classification approaches:
Model all authors’ patterns from a set of training data. Unsupervised Classification approaches: Ambiguous citations are clustered into groups of distinct authors by measuring the similarities between the attributes in the citations.

3 Proposed Approach Topic Correlation Web Correlation
Pair-Wise Grouping Algorithm

4 Topic Correlation Build a topic association network
1.利用Apriori算法构造有向图,权值为置信度(结果为一个超图)。 2.利用k-way hypergraph partition算法,将超图分解为一些簇。 3.这些簇叫做topic association network,研究课题的相关强度是citations在这个网络中的距离。

5 Web Correlation Use each title to query a search engine.
Filter the URLs of several digital libraries. If two citations appear in the same URL, we use them as an instance of Web correlation.

6 Pair-Wise Grouping Algorithm
Generate pairs of citations by using similarity metrics Use the training data to train a binary classifier Apply the classifier to determine whether the pairs are matched Combine the predicted results to group the citations into appropriate clusters. Filter out the pairs that would cause the clusters sparse.

7 Pair-Wise Similarity Metrics
similarity metrics for Coauthor, Title, and Venue: 1.CSM 2.MSF Similarity metrics for topic correlation: TSM Similarity metrics for web correlation: MNDF

8 Binary Classifier A binary classifier is used to learn the distribution of pair-wise vectors. The pairs predicted as matched are used to build citation clusters ( constructing an undirected graph).

9 Cluster Filter A threshold is set for choosing which bridges should be removed. A bridge is removed if the numbers of vertices in two separate, but connected, components are above the given threshold.

10 Detecting Ambiguous Author Names in Crowdsourced Scholarly Data
特点在于查询时间内可以得到结果。

11 Prior Work Name disambiguation has been cast into the problem of clustering a set of publications into profiles such that each profile corresponds to a single author.

12 Name Variations and Citations
Extract the name variations from a collection of publications Sort them by number of citations Look at the percentage of the total citations that are attributed to the top name variations.( A high percentage suggests that the name is not ambiguous.)

13 Topic Consistency Leverage the discipline tags crowdsourced from the users of the Scholarometer system Detect different but related disciplines associated with an author name: Map an author’s publications to topics, and measure the similarity between these topics. Derive an author’s topic profile

14 A brief survey of automatic methods for author name disambiguation
近年关于重名问题的分类与总结。 Bibliographic citation records: a set of bibliographic attributes such as author and coauthor names, work and publication venue of a particular publication.

15 Two problems Synonyms: the same author may appear under distinct names
Polysems: distinct authors may have similar names.

16 Proposed taxonomy

17 Author Grouping Methods
Defining a similarity function: 1.Using predefined functions: the Levenshtein distance, Jaccard coefficient, cosine similarity, soft-TFIDF and others. 2.Learning a similarity function: Use the training data to produce a similarity function S from R*R(R: the set of references) to {0, 1}, where 1 means that the two references do refer to the same author and 0 means that they do not. 3.Exploiting graph-based similarity functions: Create a coauthorship graph G=(V, E) for each ambiguous group. The same coauthor names are represented by a vertex, and the weight is related to the amount of articles coauthored by the corresponding author names represented by the two vertices. Author grouping methods apply a similarity function to the attributes of the references to authors( or group of references) to decide whether to group the corresponding references using a clustering technique.

18 Author Grouping Methods
Clustering Techniques: 1.Partitioning 2.Hierarchical agglomerative clustering 3.density-based clustering 4.Spectral clustering Author grouping methods apply a similarity function to the attributes of the references to authors( or group of references) to decide whether to group the corresponding references using a clustering technique.

19 Author assignment methods
Classification: Assign the references to their authors using a supervised machine learning technique. Clustering: Use probabilistic techniques to determine the author in a iterative way to fit the model. Author assignment methods directly assign each reference to a given author by constructing a model that represents the author.

20 Explored evidence Citation information: the attributes directly extracted from the citations, such as author/coauthor names, work title, publication venue title, year, and so on. Web information: Data retrieved from the web that is used as additional information about an author publication profile. Implicit evidence: Evidence inferred from visible elements of attributes, such as the latent topics of a citation.

21 Summary of characteristics-Author grouping methods

22 Summary of characteristics-Author assignment methods

23 Open challenges Very little data in the citations
Very ambiguous cases -- ambiguous references will have coauthors who have also ambiguous names (especially Asian names) Citations with errors Efficiency Different knowledge areas -- our focus is only about computer science Incremental disambiguation Author profile changes New authors

24 pandasearch 重名问题研究计划 相关论文的阅读,找出最适合当前问题的解决措施。
着重从implicit evidence和web information(特别是学者个人主页和cv)入手。 从效率和准确度两个方向着手,着重准确度。 数据挖掘和机器学习基础知识的学习。

25 pandasearch 重名问题实现计划 Type of approach: author grouping methods– learning a similarity function. Explored evidence: citation information, web information, implicit evidence.


Download ppt "Author Name Disambiguation for Citations Using Topic and Web Correlation Citation : a collection of: coauthor, title, venue, topic, and Web attributes."

Similar presentations


Ads by Google