Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren14-08-2007 1/24 Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text.

Slides:



Advertisements
Similar presentations
Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Unsupervised Learning: Clustering & Model Fitting.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Bioinformatics and Phylogenetic Analysis
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Introduction to Bioinformatics - Tutorial no. 12
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Chapter 5: Information Retrieval and Web Search
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
SciTech Strategies, Inc. BETTER MAPS BETTER DECISIONS Science Mapping and Applications: Choices and Trade-offs Kevin W. Boyack, SciTech Strategies Standards.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Text mining.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.
M. Zitt’s works related to IR – bibliometrics relationship Cocitation retrieval performance Improving retrieval-recall of cocitation clusters by expansion.
Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,
Lecture 20: Cluster Validation
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Chapter 6: Information Retrieval and Web Search
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Top X interactions of PIN Network A interactions Coverage of Network A Figure S1 - Network A interactions are distributed evenly across the top 60,000.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
CSE 4705 Artificial Intelligence
PREDICT 422: Practical Machine Learning
Automatic cLasification d
Clustering of Web pages
Introduction to Data Mining
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Representation of documents and queries
Presentation transcript:

Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren /24 Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 2/24 Overview of the presentation Introduction General context & objectives Clustering Text mining framework Bibliometrics, citation analysis Hybrid (integrated) clustering Linear combination Fisher’s inverse chi-square method Dynamic hybrid mapping of bioinformatics Conclusions Further research

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 3/24 Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining General context Complementary views on document set → other perceptions of similarity Textual information: amount of words in common Citation networks, bibliometric properties Goal: Integrate text mining & bibliometrics (hybrid approach) Better clustering and classification performance Mapping cognitive structure and dynamics of bioinformatics

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 4/24 Length Hair color Interest in football 1 Agglomerative hierarchical clustering ‘linkage’ More Discriminative power (?) ? … Binary tree, (hypothetical) Dendrogram 2 clusters 10 women 10 men … Hair colorLength Person 20 Person 3 Person 2 Person 1 Length Hair color features ‘objects’ (a) Interested in football … Hair colorLength Person 20 Person 3 Person 2 Person 1 Length Hair color Interest in football (b) 0 … 0 P3 0 P2 … P20P1 0 P20 P3 P2 0 P1 Distance matrix (e.g. Euclidean) (c) Agglomerative hierarchical clustering

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 5/24 Indexing in Vector Space Model Term 1 Term 2 Doc 1 Doc txt Text extraction.txt … Neglect structure, stop word removal, stemming, phrase detection, … ‘Bags of words’ remain … ‘Indexing’, weighting (e.g., TF-IDF)... 0 … Doc Term Term Term m Term Term 1 Doc nDoc 2Doc 1 Term-by-document matrix A vocabularyvocabulary Similarity between documents= cosine of angle between vectors Towards Mapping Library and Information Science Frizo Janssensa,*, Jacqueline Letab,c, Wolfgang B-3000 Leuven (Belgium) c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil dHungarian Academy of Sciences, Institute for Research Policy Studies, Nádor u. 18, H-1051 Budapest (Hungary) * Corresponding author: Frizo Janssens, Katholieke Universiteit Leuven, ESAT- SCD, Kasteelpark Arenberg 10, B-300 Doc 2 Doc 3 Doc n Digital documents … Doc 1

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 6/24 Bibliographic coupling Bibliometrics and network analysis x y

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 7/24 Integrate complementary information Textual content Citations Other bibliometric indicators Intermediate integration Pairwise distances calculated in separate spaces Incorporated before clustering Hybrid (integrated) clustering

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 8/24 Weighted linear combination Fisher’s inverse chi-square method documents Text-based distance matrix D text documents Distance matrix based on bibliometrics D bibl documents Integrated distance matrix D i Hierarchical clustering Text-based distances Distances based on co-citation or bibliographic coupling Integrated distances Internal validation: number of clusters? Dendrogram Silhouette curves Silhouette plot Stability diagram Using Hybrid clustering: intermediate integration

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 9/24 Weighted linear combination (linco) D i = α · D text + (1-α) · D BIBL Attractive, easy, and scalable However, neglects differences in distributional characteristics ! Histograms of mutual distances (<1) based on text (left) and BC (right) Unequal or unfair contribution of data sources Implicitly favoring text over bibliometric information or vice versa

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 10/24 ‘Omnibus statistic’ from statistical meta-analysis Combine p-values from multiple sources Freed from distributional differences Avoids overcompensation of either data source Fisher’s inverse chi-square method

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 11/24 ponm tsrq lkji hgfe dcba documents terms ‘real’ text data documents citations ‘real’ citation data p-value p 1 p-value p 2 y z 0p1p documents p-values 0p2p documents dist Cumul. share y cdf dist Cumul. share z cdf imsa ocfp tdjh qren glbk documents terms randomize randomized text data documents citations randomize randomized citation data distance matrices documents DtDt D bc 0pipi documents Integrated p-values p i = -2 · log(p 1 λ · p 2 1-λ ) Fisher’s omnibus: DiDi y z Fisher’s inverse chi-square method

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 12/24 Histogram of pairwise document distances for text and BC Histogram of p-values for real data w.r.t. randomized datasets Fisher’s inverse chi-square method

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 13/24 Text-only >> cited references SVD greatly ameliorates results, especially for text (LSI) Best performance: integration ! Fisher's inverse chi-square Significantly > text-only, link-only, & concatenation No significant difference with linco’s when SVD Generic, incorporate distances with highly dissimilar distributions Weighted linco: good option if LSI is used Conclusions from previous research F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 14/24 Dynamic hybrid mapping of bioinformatics Total: 7401

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 15/24 Number of clusters and LSI factors

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 16/24 Number of clusters: stability diagram

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 17/24 Number of clusters: link-based Silhouette values

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 18/24 Dendrogram 1. RNA structure prediction 2. Protein structure prediction 3. Systems biology & molecular networks 4. Phylogeny & evolution 5. Genome sequencing & assembly 6. Gene/promoter/motif prediction 7. Molecular DBs & annotation platforms 8. Multiple sequence alignment 9. Microarray analysis

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 19/24

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 20/24

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 21/24 Dynamics

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 22/24 Dynamic term networks

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 23/24 Main contributions Hybrid clustering (of bioinformatics) Clustering and classification significantly improved Generic: other application domains Further Research Fuzzy clustering Semi-supervised clustering and active learning Spectral clustering Other matrix decompositions (e.g., NMF) Multilinear (tensor) algebra Mapping the world’s total yearly publication output Detect emerging and converging clusters & hot topics Science-technology interaction Conclusions

Introduction Text mining Bibliometrics & network analysis Hybrid clustering Dynamic hybrid mapping of bioinformatics Conclusions Poster #17 24/24 ? &