Presentation is loading. Please wait.

Presentation is loading. Please wait.

What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

Similar presentations


Presentation on theme: "What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical."— Presentation transcript:

1 What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of Economy K.U.Leuven, Belgium

2 2 ntroduction I

3 3 Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research clinical bioinformatics gene regulation bioinformatics Research on algorithms and software development for: Text mining Gibbs sampling Graphical models Classification & clustering

4 4 Introduction: K.U. Leuven Faculty of Applied Sciences Department of Electrical Engineering Bio-informatics research Text mining research Combine statistical approaches with domain-specific requirements Knowledge discovery through literature analysis in various domains: Bio-informatics Sciento- & Technometrics Knowledge management

5 5 Overview Bio-informatics: –gene profiling –multi-view learning Scientific trend mapping –clustering and bibliometric indicators Innovation & Spillovers –Tracing of person in science & technology spaces 25’ 5-10’

6 6 Overview Information Retrieval Information Extraction Full NLP parsing Shallow Statistics Generic Problem specific Domain- specific Shallow Parsing Document analysis & Extraction of tokens  Text mining goals  Text mining methodology  Overall approach

7 7 ase 1: C Literature & biological data

8 8

9 9 protein

10 10 ‘Post-genome’ biology  focus shift : - from single gene to gene groups - complex interactions within cellular environment  microarrays measure the simultaneous activity: Gene expression measurement G1 G2 G3.. C1 C2C3.. Sample annotations Gene annotations

11 11 ClusteringInterpretation gene conditions Expression data

12 12 gene conditions Expression data gene expression Databases annotations and relations encoded as free text PRIOR INFORMATION Integrated analysis

13 13 Hence, 2 views: Text analysis for interpretation (supportive role) Text analytics for ‘inference’ (active role)

14 14 A ‘historical’ quote: `Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database’ (M. Gerstein, 2001) 12133521 12133521 VEGF is associated with the development and prognosis of colorectal cancer. 12168088 12168088 PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression. 11866538 11866538 Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex GeneRIF GO cell proliferation heparin binding growth factor activity

15 15 Controlled vocabularies are of great value when constructing interoperable and computer- parsable systems. Structured vocabularies are on the rise GO MeSH eVOC Standards are systematically being adopted to store biological concepts or annotations: HUGO for gene names GOA … Increased awareness

16 16 (GOF) Vector space model Document processing –Remove punctuation & grammatical structure (`Bag of words’) –Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) –Define weighing scheme and/or transformations (tf-idf,svd,..) index T 1 T 3 T 2 vocabulary gene

17 17 Validity of gene index Genes that are functionally related should be close in text space:  Modeled wrt a background distribution of  through random and permuted gene groups Text-based coherence score

18 18 Validity of gene index Genes that are functionally related should be close in text space:

19 19 Validity of gene index Genes that are functionally related should be close in text space:

20 20  Data-centered statistical scores  Coherence vs separation of clusters  Stability of a cluster solution when leaving out data Define `optimal’ ? Optimal number of clusters ? C1 C3 C2 Text-based scoring

21 21  Data-centered statistical scores  Knowledge-based scores  Enrichment of GO annotations in clusters  Literature-based scoring Define `optimal’ ? Optimal number of clusters ?

22 22 Collaborative gene filtering

23 23 TXTGate a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated database entries & linked scientific publications. incorporates term-based indices.... and use them as a starting point –to explore the text through the eyes of different domain vocabularies –to link out to other resources by query building, or –to sub-cluster genes based on text.

24 24 Term-centric Gene-centric Domain vocabularies as ‘views’

25 25 Query building to external DB

26 26 Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s … that allow some level of interoperability with external annotation databases Sub-clustering gene groups useful to detect biological sub-patterns Reasonably robust to corrupted groups Gene index normalizes for unbalanced references Features of the approach

27 27 Text analysis for interpretation (supportive role) Text analytics for ‘inference’ (active role)

28 28 Meta-clustering text & data As multiple information sources are available when analyzing gene expression data, we pose the question: “How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”..

29 29 Mathematical integration

30 30 In each information space –Appropriate preprocessing –Choice of distance measures Integration of text & data

31 31 Combine data: confidence attributed to either of the two data types in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

32 32 However, distribution of distances invoke a bias  Scaling problem Therefore, use technique from statistical meta-analysis (so-called omnibus procedure) Expression Distance histogram Text Distance histogram

33 33 M-score expression data only M-score integrated clustering Various cutoffs k of the cluster tree Optimal k ?

34 34 A peek inside

35 35 A peek inside Expression Profile Text Profile Strong re-enforcement

36 36 ase 2: C Sciento- & technometrics

37 37 Mapping of Science Journal ‘Scientometrics’ Full-text articles Document cluster analysis Co-word mapping Temporal dimension: clusters over time

38 38 Mapping of Science Coupling with bibliometric indicators; –Based on reference (hyperlink) information –Mean reference Age –Nr Serials

39 39 Domain studies in Patent space 30 technology classes ‘Seed’ patent Similarities

40 40 User profiling & Author-Inventor linkage Name resolution –Same persons (variants, mistakes) –Different persons (similar initials, or even full name) Van VeldhovenVeldhoven, Van Wim Van VeldhovenWalter Van Veldhoven Wim Van Veldhoven VanveldhovenVan Veldhoven

41 41 Content-based name matching Detect spillovers and entrepreneurial activities at (e.g.) university-level Matching of ‘inventors’ & ‘authors’ time- consuming  semi-automated approach: Patent DBPublication DB Relevance ranking

42 42 Acknowledgements Steunpunt O&O Statistieken Debackere KGlänzel W ESAT / BioI / Text Mining: Coessens BVan Vooren SJanssens FVan Dromme D ESAT / BioI: Moreau YDe Moor B

43 43 Thanks ! ? ? CONTACT INFO: Patrick.glenisson@econ.kuleuven.be


Download ppt "What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical."

Similar presentations


Ads by Google