Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students.

Similar presentations


Presentation on theme: "Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students."— Presentation transcript:

1 Overview of Text Mining Expertise @ SCD

2  Text Mining @ SCD Introduction  Text mining team @ SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students  Tailored, generic text mining analysis  Diverse application areas  Several collaborations and projects.  Supported by more general SCD expertise in a.o. Data mining Numerical linear algebra Optimization

3  Text Mining @ SCD Strategic mission  To consolidate, deepen and extend SCD’s text mining expertise  By combining statistical approaches and domain-specific information  To support knowledge discovery through literature analysis in various domains:  Bio-informatics  Knowledge management  Mapping of science and technology  Bibliometrics

4  Text Mining @ SCD Problem setting  Given a set of documents,  compute a representation, called index  to retrieve, summarize, classify or cluster them 

5  Text Mining @ SCD Problem setting - 2 Information Retrieval Information Extraction Full NLP parsing Shallow Statistics Generic Problem specific Domain- specific Shallow Parsing Document analysis & Extraction of tokens  Text mining goals  Text mining methodology  Overall approach

6  Text Mining @ SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

7  Text Mining @ SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

8  Text Mining @ SCD Document-centered mining  Given a set of documents,  compute a representation, called index  to retrieve, summarize, classify or cluster them 

9  Text Mining @ SCD Gene-centered mining  Given a set of genes (and their literature),  compute a representation, called gene index  to retrieve, summarize, classify or cluster them 

10  Text Mining @ SCD Patient-centered mining  Given a set of patients (and their records),  compute a representation, called patient index  to retrieve, classify them ..and/or associate this information to genes 

11  Text Mining @ SCD Functional genomics : gene profiling  Profile documents, genes, … using vocabularies (bag of words approach)  Tailored vocabularies reflect the 'knowledge' of a certain domain: + noise reduction (i.e. irrelevant words) + direct link with other knowledge bases (eg. Gene Ontology) vocabulary T 1 T 3 T 2 gene Bert Coessens

12  Text Mining @ SCD Functional Genomics - TXTGate Distance matrix & Clustering Other vocabulary Bert Coessens; Steven Van Vooren

13  Text Mining @ SCD Functional genomics – Networks from literature  gene networks  term networks Bert Coessens; Frizo Janssens

14  Text Mining @ SCD Human genetics  Collaboration with Human Genetics Centre @ University Hospital KU Leuven.  Mining on clinical profile and chromosomal footprint of patients (CGH microarrays)  Knowledge discovery for genomic annotation  Aiming at tools and standards for reporting, data entry and visualisation supporting experts in exploring hypotheses in linking phenotypes to genotypes and in inference of novel gene candidates Steven Van Vooren Data Analysis Text Analysis NLP; Ontologies

15  Text Mining @ SCD Human genetics  Knowledge discovery for genomic annotation From µA-CGH profiles From Biomedical text  Similarity measures for biomedical text what: patient records, literature, genes, loci, clones why: retrieval, clustering, inference Clustering similar patients, genes, loci, documents Finding genes associated by patient records  Extracting entities from text gene name symbols, loci, diseases, phenotypes, clinical entities, karyotypes  Text summarization Profiling of patients, genes, loci, clones, clusters of ~. Steven Van Vooren

16  Text Mining @ SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

17  Text Mining @ SCD McKnow Project Clustering and classification are focal points, as well as scalability because of the huge corpora of available data nowadays. We incorporate user profiles, and as such regard both users and documents as points in a high-dimensional vector space. Furthermore, as environments are typically dynamical, care is taken that used methods are easily updatable. Dries Van Dromme; Frizo Janssens  Automated and User-oriented Methods and algorithms for knowledge management  Collaboration with Center for Industrial Management, KUL

18  Text Mining @ SCD Case studies knowledge management  Dimensionality of clustered text-mining cases:  sista papers electronically available publications (ps, pdf) – full text 1024 x 49.237  De Standaard full text newspaper articles, but a lot of them very short 1776 x 39.363 - but much more data available  kuleuven papers electronically available papers pertaining to researchers from different departments (pdf, word,...) 576 x 68.257 ! less documents, broader spectrum  patent abstracts international patent abstracts and titles 16.488 x 21.019 ! a lot more doc’s, denser spectrum  PMA papers full text publications of the K.U.Leuven dept. of Mechanics 380 x 18.206  Locuslink “known genes with proteins” gene documents from MEDLINE abstracts 12.263 x 58.924 Dries Van Dromme

19  Text Mining @ SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

20  Text Mining @ SCD Scope  Bibliometrics the application of mathematical and statistical methods to books and other media of communication  Scientometrics the application of those quantitative methods which are dealing with the analysis of science viewed as an information process  Patent analysis and mining The analysis of patent information is considered to be one of the best established, directly available and historically reliable methods of quantifying the output of a science and technology system  Collaboration with Steunpunt O&O Statistieken >

21  Text Mining @ SCD Projects  1. Domain Analysis  Mapping of Nanotechnology field from USPTO/EPO patents Text-based clustering ; identification of sub-domains comparison with IPC (International Patent Classification) comparison with FTC (Fraunhofer Technology Classification)  2. Science-Technology mapping  link scientific publications (WoS) and new technologies (patents) text-based clustering & analysis of citation network structure Case study: Ljung  3. Trend Detection  assess trends & emerging fields from “change over time” in structure and characterization of clusters & citation network Dries Van Dromme; Frizo Janssens

22  Text Mining @ SCD Software  Preprocessing &Indexing  Lucene & TextPack  Search engine and webservices  TXTGate and McKnow

23  Text Mining @ SCD Publications targetted submissions by Dec  Bio-informatics (1-2)  (BMC) bioinformatics, special issues,.. (BC)  More biological journals (BC, SVV)  Knowledge management (1)  Scientometrics, SIAM DM,  Bibliometrics & scientometrics (1)  Case study  Bioinformatics, Trends in..  IEEE transactions, engineering, webmining journals  SIAM DM High, moderate, fair impact

24  Text Mining @ SCD Collaborations  Formalized  GBOU-McKnow partner CIB olv Joost Duflou (Joris Vertommen, Dries Cleymans) User Committee (ICMS, Verhaert, LMS, TriSoft, WTCM)  IWT met Joris V (Steven: aanvullen/corrigeren)  Steunpunt O&O Statistieken, INCENTIM Patent clustering and detection of emerging trends  Informal  M-F Moens (SBO ?)  IBM – Bart VL  Gasthuisberg en Peter M: TXTGate als ‘vak’  J&J


Download ppt "Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students."

Similar presentations


Ads by Google