What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

Slides:



Advertisements
Similar presentations
Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.
Advertisements

Chapter 5: Introduction to Information Retrieval
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
WMES3103 : INFORMATION RETRIEVAL
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Scalable Text Mining with Sparse Generative Models
Data Mining – Intro.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis December 21th 2004.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
SciTech Strategies, Inc. BETTER MAPS BETTER DECISIONS Science Mapping and Applications: Choices and Trade-offs Kevin W. Boyack, SciTech Strategies Standards.
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,
Flexible Text Mining using Interactive Information Extraction David Milward
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Gene expression analysis
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
 PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004.
Group A Next Generation Information Access Group.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
David Amar, Tom Hait, and Ron Shamir
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Natural Language Processing (NLP)
Data Warehousing and Data Mining
CSE 635 Multimedia Information Retrieval
Batyr Charyyev.
Natural Language Processing (NLP)
Introduction to Search Engines
Natural Language Processing (NLP)
Presentation transcript:

What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of Economy K.U.Leuven, Belgium

2 ntroduction I

3 Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research clinical bioinformatics gene regulation bioinformatics Research on algorithms and software development for: Text mining Gibbs sampling Graphical models Classification & clustering

4 Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Text mining research Combine statistical approaches with domain-specific requirements Knowledge discovery through literature analysis in various domains: Bio-informatics Sciento- & Technometrics Knowledge management

5 Overview Bio-informatics: –gene profiling –multi-view learning Scientific trend mapping –clustering and bibliometric indicators Innovation & Spillovers –Tracing of person in science & technology spaces 25’ 5-10’

6 Overview Information Retrieval Information Extraction Full NLP parsing Shallow Statistics Generic Problem specific Domain- specific Shallow Parsing Document analysis & Extraction of tokens  Text mining goals  Text mining methodology  Overall approach

7 ase 1: C Literature & biological data

8

9 protein

10 ‘Post-genome’ biology  focus shift : - from single gene to gene groups - complex interactions within cellular environment  microarrays measure the simultaneous activity: Gene expression measurement G1 G2 G3.. C1 C2C3.. Sample annotations Gene annotations

11 ClusteringInterpretation gene conditions Expression data

12 gene conditions Expression data gene expression Databases annotations and relations encoded as free text PRIOR INFORMATION Integrated analysis

13 Hence, 2 views: Text analysis for interpretation (supportive role) Text analytics for ‘inference’ (active role)

14 A ‘historical’ quote: `Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database’ (M. Gerstein, 2001) VEGF is associated with the development and prognosis of colorectal cancer PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex GeneRIF GO cell proliferation heparin binding growth factor activity

15 Controlled vocabularies are of great value when constructing interoperable and computer- parsable systems. Structured vocabularies are on the rise GO MeSH eVOC Standards are systematically being adopted to store biological concepts or annotations: HUGO for gene names GOA … Increased awareness

16 (GOF) Vector space model Document processing –Remove punctuation & grammatical structure (`Bag of words’) –Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) –Define weighing scheme and/or transformations (tf-idf,svd,..) index T 1 T 3 T 2 vocabulary gene

17 Validity of gene index Genes that are functionally related should be close in text space:  Modeled wrt a background distribution of  through random and permuted gene groups Text-based coherence score

18 Validity of gene index Genes that are functionally related should be close in text space:

19 Validity of gene index Genes that are functionally related should be close in text space:

20  Data-centered statistical scores  Coherence vs separation of clusters  Stability of a cluster solution when leaving out data Define `optimal’ ? Optimal number of clusters ? C1 C3 C2 Text-based scoring

21  Data-centered statistical scores  Knowledge-based scores  Enrichment of GO annotations in clusters  Literature-based scoring Define `optimal’ ? Optimal number of clusters ?

22 Collaborative gene filtering

23 TXTGate a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. incorporates term-based indices.... and use them as a starting point –to explore the text through the eyes of different domain vocabularies –to link out to other resources by query building, or –to sub-cluster genes based on text.

24 Term-centric Gene-centric Domain vocabularies as ‘views’

25 Query building to external DB

26 Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s … that allow some level of interoperability with external annotation databases Sub-clustering gene groups useful to detect biological sub-patterns Reasonably robust to corrupted groups Gene index normalizes for unbalanced references Features of the approach

27 Text analysis for interpretation (supportive role) Text analytics for ‘inference’ (active role)

28 Meta-clustering text & data As multiple information sources are available when analyzing gene expression data, we pose the question: “How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”..

29 Mathematical integration

30 In each information space –Appropriate preprocessing –Choice of distance measures Integration of text & data

31 Combine data: confidence attributed to either of the two data types in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

32 However, distribution of distances invoke a bias  Scaling problem Therefore, use technique from statistical meta-analysis (so-called omnibus procedure) Expression Distance histogram Text Distance histogram

33 M-score expression data only M-score integrated clustering Various cutoffs k of the cluster tree Optimal k ?

34 A peek inside

35 A peek inside Expression Profile Text Profile Strong re-enforcement

36 ase 2: C Sciento- & technometrics

37 Mapping of Science Journal ‘Scientometrics’ Full-text articles Document cluster analysis Co-word mapping Temporal dimension: clusters over time

38 Mapping of Science Coupling with bibliometric indicators; –Based on reference (hyperlink) information –Mean reference Age –Nr Serials

39 Domain studies in Patent space 30 technology classes ‘Seed’ patent Similarities

40 User profiling & Author-Inventor linkage Name resolution –Same persons (variants, mistakes) –Different persons (similar initials, or even full name) Van VeldhovenVeldhoven, Van Wim Van VeldhovenWalter Van Veldhoven Wim Van Veldhoven VanveldhovenVan Veldhoven

41 Content-based name matching Detect spillovers and entrepreneurial activities at (e.g.) university-level Matching of ‘inventors’ & ‘authors’ time- consuming  semi-automated approach: Patent DBPublication DB Relevance ranking

42 Acknowledgements Steunpunt O&O Statistieken Debackere KGlänzel W ESAT / BioI / Text Mining: Coessens BVan Vooren SJanssens FVan Dromme D ESAT / BioI: Moreau YDe Moor B

43 Thanks ! ? ? CONTACT INFO: