Download presentation
Presentation is loading. Please wait.
1
What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of Economy K.U.Leuven, Belgium
2
2 ntroduction I
3
3 Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research clinical bioinformatics gene regulation bioinformatics Research on algorithms and software development for: Text mining Gibbs sampling Graphical models Classification & clustering
4
4 Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Text mining research Combine statistical approaches with domain-specific requirements Knowledge discovery through literature analysis in various domains: Bio-informatics Sciento- & Technometrics Knowledge management
5
5 Overview Bio-informatics: –gene profiling –multi-view learning Scientific trend mapping –clustering and bibliometric indicators Innovation & Spillovers –Tracing of person in science & technology spaces 25’ 5-10’
6
6 Overview Information Retrieval Information Extraction Full NLP parsing Shallow Statistics Generic Problem specific Domain- specific Shallow Parsing Document analysis & Extraction of tokens Text mining goals Text mining methodology Overall approach
7
7 ase 1: C Literature & biological data
8
8
9
9 protein
10
10 ‘Post-genome’ biology focus shift : - from single gene to gene groups - complex interactions within cellular environment microarrays measure the simultaneous activity: Gene expression measurement G1 G2 G3.. C1 C2C3.. Sample annotations Gene annotations
11
11 ClusteringInterpretation gene conditions Expression data
12
12 gene conditions Expression data gene expression Databases annotations and relations encoded as free text PRIOR INFORMATION Integrated analysis
13
13 Hence, 2 views: Text analysis for interpretation (supportive role) Text analytics for ‘inference’ (active role)
14
14 A ‘historical’ quote: `Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database’ (M. Gerstein, 2001) 12133521 12133521 VEGF is associated with the development and prognosis of colorectal cancer. 12168088 12168088 PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538 11866538 Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex GeneRIF GO cell proliferation heparin binding growth factor activity
15
15 Controlled vocabularies are of great value when constructing interoperable and computer- parsable systems. Structured vocabularies are on the rise GO MeSH eVOC Standards are systematically being adopted to store biological concepts or annotations: HUGO for gene names GOA … Increased awareness
16
16 (GOF) Vector space model Document processing –Remove punctuation & grammatical structure (`Bag of words’) –Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) –Define weighing scheme and/or transformations (tf-idf,svd,..) index T 1 T 3 T 2 vocabulary gene
17
17 Validity of gene index Genes that are functionally related should be close in text space: Modeled wrt a background distribution of through random and permuted gene groups Text-based coherence score
18
18 Validity of gene index Genes that are functionally related should be close in text space:
19
19 Validity of gene index Genes that are functionally related should be close in text space:
20
20 Data-centered statistical scores Coherence vs separation of clusters Stability of a cluster solution when leaving out data Define `optimal’ ? Optimal number of clusters ? C1 C3 C2 Text-based scoring
21
21 Data-centered statistical scores Knowledge-based scores Enrichment of GO annotations in clusters Literature-based scoring Define `optimal’ ? Optimal number of clusters ?
22
22 Collaborative gene filtering
23
23 TXTGate a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. incorporates term-based indices.... and use them as a starting point –to explore the text through the eyes of different domain vocabularies –to link out to other resources by query building, or –to sub-cluster genes based on text.
24
24 Term-centric Gene-centric Domain vocabularies as ‘views’
25
25 Query building to external DB
26
26 Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s … that allow some level of interoperability with external annotation databases Sub-clustering gene groups useful to detect biological sub-patterns Reasonably robust to corrupted groups Gene index normalizes for unbalanced references Features of the approach
27
27 Text analysis for interpretation (supportive role) Text analytics for ‘inference’ (active role)
28
28 Meta-clustering text & data As multiple information sources are available when analyzing gene expression data, we pose the question: “How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”..
29
29 Mathematical integration
30
30 In each information space –Appropriate preprocessing –Choice of distance measures Integration of text & data
31
31 Combine data: confidence attributed to either of the two data types in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.
32
32 However, distribution of distances invoke a bias Scaling problem Therefore, use technique from statistical meta-analysis (so-called omnibus procedure) Expression Distance histogram Text Distance histogram
33
33 M-score expression data only M-score integrated clustering Various cutoffs k of the cluster tree Optimal k ?
34
34 A peek inside
35
35 A peek inside Expression Profile Text Profile Strong re-enforcement
36
36 ase 2: C Sciento- & technometrics
37
37 Mapping of Science Journal ‘Scientometrics’ Full-text articles Document cluster analysis Co-word mapping Temporal dimension: clusters over time
38
38 Mapping of Science Coupling with bibliometric indicators; –Based on reference (hyperlink) information –Mean reference Age –Nr Serials
39
39 Domain studies in Patent space 30 technology classes ‘Seed’ patent Similarities
40
40 User profiling & Author-Inventor linkage Name resolution –Same persons (variants, mistakes) –Different persons (similar initials, or even full name) Van VeldhovenVeldhoven, Van Wim Van VeldhovenWalter Van Veldhoven Wim Van Veldhoven VanveldhovenVan Veldhoven
41
41 Content-based name matching Detect spillovers and entrepreneurial activities at (e.g.) university-level Matching of ‘inventors’ & ‘authors’ time- consuming semi-automated approach: Patent DBPublication DB Relevance ranking
42
42 Acknowledgements Steunpunt O&O Statistieken Debackere KGlänzel W ESAT / BioI / Text Mining: Coessens BVan Vooren SJanssens FVan Dromme D ESAT / BioI: Moreau YDe Moor B
43
43 Thanks ! ? ? CONTACT INFO: Patrick.glenisson@econ.kuleuven.be
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.