Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren /24 Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 2/24 Overview of the presentation Introduction General context & objectives Clustering Text mining framework Bibliometrics, citation analysis Hybrid (integrated) clustering Linear combination Fisher’s inverse chi-square method Dynamic hybrid mapping of bioinformatics Conclusions Further research
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 3/24 Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining General context Complementary views on document set → other perceptions of similarity Textual information: amount of words in common Citation networks, bibliometric properties Goal: Integrate text mining & bibliometrics (hybrid approach) Better clustering and classification performance Mapping cognitive structure and dynamics of bioinformatics
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 4/24 Length Hair color Interest in football 1 Agglomerative hierarchical clustering ‘linkage’ More Discriminative power (?) ? … Binary tree, (hypothetical) Dendrogram 2 clusters 10 women 10 men … Hair colorLength Person 20 Person 3 Person 2 Person 1 Length Hair color features ‘objects’ (a) Interested in football … Hair colorLength Person 20 Person 3 Person 2 Person 1 Length Hair color Interest in football (b) 0 … 0 P3 0 P2 … P20P1 0 P20 P3 P2 0 P1 Distance matrix (e.g. Euclidean) (c) Agglomerative hierarchical clustering
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 5/24 Indexing in Vector Space Model Term 1 Term 2 Doc 1 Doc txt Text extraction.txt … Neglect structure, stop word removal, stemming, phrase detection, … ‘Bags of words’ remain … ‘Indexing’, weighting (e.g., TF-IDF)... 0 … Doc Term Term Term m Term Term 1 Doc nDoc 2Doc 1 Term-by-document matrix A vocabularyvocabulary Similarity between documents= cosine of angle between vectors Towards Mapping Library and Information Science Frizo Janssensa,*, Jacqueline Letab,c, Wolfgang B-3000 Leuven (Belgium) c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil dHungarian Academy of Sciences, Institute for Research Policy Studies, Nádor u. 18, H-1051 Budapest (Hungary) * Corresponding author: Frizo Janssens, Katholieke Universiteit Leuven, ESAT- SCD, Kasteelpark Arenberg 10, B-300 Doc 2 Doc 3 Doc n Digital documents … Doc 1
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 6/24 Bibliographic coupling Bibliometrics and network analysis x y
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 7/24 Integrate complementary information Textual content Citations Other bibliometric indicators Intermediate integration Pairwise distances calculated in separate spaces Incorporated before clustering Hybrid (integrated) clustering
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 8/24 Weighted linear combination Fisher’s inverse chi-square method documents Text-based distance matrix D text documents Distance matrix based on bibliometrics D bibl documents Integrated distance matrix D i Hierarchical clustering Text-based distances Distances based on co-citation or bibliographic coupling Integrated distances Internal validation: number of clusters? Dendrogram Silhouette curves Silhouette plot Stability diagram Using Hybrid clustering: intermediate integration
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 9/24 Weighted linear combination (linco) D i = α · D text + (1-α) · D BIBL Attractive, easy, and scalable However, neglects differences in distributional characteristics ! Histograms of mutual distances (<1) based on text (left) and BC (right) Unequal or unfair contribution of data sources Implicitly favoring text over bibliometric information or vice versa
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 10/24 ‘Omnibus statistic’ from statistical meta-analysis Combine p-values from multiple sources Freed from distributional differences Avoids overcompensation of either data source Fisher’s inverse chi-square method
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 11/24 ponm tsrq lkji hgfe dcba documents terms ‘real’ text data documents citations ‘real’ citation data p-value p 1 p-value p 2 y z 0p1p documents p-values 0p2p documents dist Cumul. share y cdf dist Cumul. share z cdf imsa ocfp tdjh qren glbk documents terms randomize randomized text data documents citations randomize randomized citation data distance matrices documents DtDt D bc 0pipi documents Integrated p-values p i = -2 · log(p 1 λ · p 2 1-λ ) Fisher’s omnibus: DiDi y z Fisher’s inverse chi-square method
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 12/24 Histogram of pairwise document distances for text and BC Histogram of p-values for real data w.r.t. randomized datasets Fisher’s inverse chi-square method
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 13/24 Text-only >> cited references SVD greatly ameliorates results, especially for text (LSI) Best performance: integration ! Fisher's inverse chi-square Significantly > text-only, link-only, & concatenation No significant difference with linco’s when SVD Generic, incorporate distances with highly dissimilar distributions Weighted linco: good option if LSI is used Conclusions from previous research F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 14/24 Dynamic hybrid mapping of bioinformatics Total: 7401
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 15/24 Number of clusters and LSI factors
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 16/24 Number of clusters: stability diagram
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 17/24 Number of clusters: link-based Silhouette values
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 18/24 Dendrogram 1. RNA structure prediction 2. Protein structure prediction 3. Systems biology & molecular networks 4. Phylogeny & evolution 5. Genome sequencing & assembly 6. Gene/promoter/motif prediction 7. Molecular DBs & annotation platforms 8. Multiple sequence alignment 9. Microarray analysis
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 19/24
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 20/24
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 21/24 Dynamics
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 22/24 Dynamic term networks
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 23/24 Main contributions Hybrid clustering (of bioinformatics) Clustering and classification significantly improved Generic: other application domains Further Research Fuzzy clustering Semi-supervised clustering and active learning Spectral clustering Other matrix decompositions (e.g., NMF) Multilinear (tensor) algebra Mapping the world’s total yearly publication output Detect emerging and converging clusters & hot topics Science-technology interaction Conclusions
Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 24/24 ? &