Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students.

Slides:



Advertisements
Similar presentations
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Advertisements

13:10:58 A New Tool for Mapping Microarray Data onto the Gene Ontology Structure ( Abstract e GOn (explore Gene Ontology) is a.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren /24 Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text.
What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Overview of Biomedical Informatics Rakesh Nagarajan.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Biomedical innovation at the laboratory, clinical and commercial interface. Mapping research grants, publications and patents in the field of microarrays.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Medical Informatics Basics
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Status of ICT structure, infrastructure and applications existed to manage and disseminate information and knowledge of Agricultural Biotechnology Innovations.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
SCOPUS AND SCIVAL EVALUATION AND PROMOTION OF UKRAINIAN RESEARCH RESULTS PIOTR GOŁKIEWICZ PRODUCT SALES MANAGER, CENTRAL AND EASTERN EUROPE KIEV, 31 JANUARY.
Data Mining Techniques
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Formal Empirical Applied Mathematical and technical methods and theories Cognitive, behavioral, and organizational techniques and theories ImagingBioInformaticsClinical.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Medline Text Searching Tools – a Comparison Experiment McDermott Center for Human Growth and Development Center for Biomedical Inventions.
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Information Systems Basic Core Specialization Clinical Imaging BioInformatics Public Health Computer Science Methods (formal models) Biomedical Decision.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Bioinformatics and medicine: Are we meeting the challenge?
Computers in Healthcare Jinbo Bi Department of Computer Science and Engineering Connecticut Institute for Clinical and Translational Research University.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
Introduction to Databases Vetle I. Torvik. DNA was the 20 th century - Databases are the 21 st century 4 Quantum leaps in the evolution of human brain.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
1 Making a Grope for an Understanding of Taiwan’s Scientific Performance through the Use of Quantified Indicators Prof. Dr. Hsien-Chun Meng Science and.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
The Interplay Between Mathematics/Computation and Analytics Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Data Mining and Decision Support
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
The Thomson Reuters Journal Selection Policy – Building Great Journals - Adding Value to Web of Science Maintaining and Growing Web of Science Regional.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
TDM in the Life Sciences Application to Drug Repositioning *
What Is Cluster Analysis?
Sentiment analysis algorithms and applications: A survey
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
School of Computer Science & Engineering
Bibliometric Analysis of Water Research
Data Warehousing and Data Mining
Networked Information Resources
Presentation transcript:

Overview of Text Mining SCD

 Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students  Tailored, generic text mining analysis  Diverse application areas  Several collaborations and projects.  Supported by more general SCD expertise in a.o. Data mining Numerical linear algebra Optimization

 Text SCD Strategic mission  To consolidate, deepen and extend SCD’s text mining expertise  By combining statistical approaches and domain-specific information  To support knowledge discovery through literature analysis in various domains:  Bio-informatics  Knowledge management  Mapping of science and technology  Bibliometrics

 Text SCD Problem setting  Given a set of documents,  compute a representation, called index  to retrieve, summarize, classify or cluster them 

 Text SCD Problem setting - 2 Information Retrieval Information Extraction Full NLP parsing Shallow Statistics Generic Problem specific Domain- specific Shallow Parsing Document analysis & Extraction of tokens  Text mining goals  Text mining methodology  Overall approach

 Text SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

 Text SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

 Text SCD Document-centered mining  Given a set of documents,  compute a representation, called index  to retrieve, summarize, classify or cluster them 

 Text SCD Gene-centered mining  Given a set of genes (and their literature),  compute a representation, called gene index  to retrieve, summarize, classify or cluster them 

 Text SCD Patient-centered mining  Given a set of patients (and their records),  compute a representation, called patient index  to retrieve, classify them ..and/or associate this information to genes 

 Text SCD Functional genomics : gene profiling  Profile documents, genes, … using vocabularies (bag of words approach)  Tailored vocabularies reflect the 'knowledge' of a certain domain: + noise reduction (i.e. irrelevant words) + direct link with other knowledge bases (eg. Gene Ontology) vocabulary T 1 T 3 T 2 gene Bert Coessens

 Text SCD Functional Genomics - TXTGate Distance matrix & Clustering Other vocabulary Bert Coessens; Steven Van Vooren

 Text SCD Functional genomics – Networks from literature  gene networks  term networks Bert Coessens; Frizo Janssens

 Text SCD Human genetics  Collaboration with Human Genetics University Hospital KU Leuven.  Mining on clinical profile and chromosomal footprint of patients (CGH microarrays)  Knowledge discovery for genomic annotation  Aiming at tools and standards for reporting, data entry and visualisation supporting experts in exploring hypotheses in linking phenotypes to genotypes and in inference of novel gene candidates Steven Van Vooren Data Analysis Text Analysis NLP; Ontologies

 Text SCD Human genetics  Knowledge discovery for genomic annotation From µA-CGH profiles From Biomedical text  Similarity measures for biomedical text what: patient records, literature, genes, loci, clones why: retrieval, clustering, inference Clustering similar patients, genes, loci, documents Finding genes associated by patient records  Extracting entities from text gene name symbols, loci, diseases, phenotypes, clinical entities, karyotypes  Text summarization Profiling of patients, genes, loci, clones, clusters of ~. Steven Van Vooren

 Text SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

 Text SCD McKnow Project Clustering and classification are focal points, as well as scalability because of the huge corpora of available data nowadays. We incorporate user profiles, and as such regard both users and documents as points in a high-dimensional vector space. Furthermore, as environments are typically dynamical, care is taken that used methods are easily updatable. Dries Van Dromme; Frizo Janssens  Automated and User-oriented Methods and algorithms for knowledge management  Collaboration with Center for Industrial Management, KUL

 Text SCD Case studies knowledge management  Dimensionality of clustered text-mining cases:  sista papers electronically available publications (ps, pdf) – full text 1024 x  De Standaard full text newspaper articles, but a lot of them very short 1776 x but much more data available  kuleuven papers electronically available papers pertaining to researchers from different departments (pdf, word,...) 576 x ! less documents, broader spectrum  patent abstracts international patent abstracts and titles x ! a lot more doc’s, denser spectrum  PMA papers full text publications of the K.U.Leuven dept. of Mechanics 380 x  Locuslink “known genes with proteins” gene documents from MEDLINE abstracts x Dries Van Dromme

 Text SCD Overview  Bio-informatics  Knowledge management  Bibliometrics & scientometrics

 Text SCD Scope  Bibliometrics the application of mathematical and statistical methods to books and other media of communication  Scientometrics the application of those quantitative methods which are dealing with the analysis of science viewed as an information process  Patent analysis and mining The analysis of patent information is considered to be one of the best established, directly available and historically reliable methods of quantifying the output of a science and technology system  Collaboration with Steunpunt O&O Statistieken >

 Text SCD Projects  1. Domain Analysis  Mapping of Nanotechnology field from USPTO/EPO patents Text-based clustering ; identification of sub-domains comparison with IPC (International Patent Classification) comparison with FTC (Fraunhofer Technology Classification)  2. Science-Technology mapping  link scientific publications (WoS) and new technologies (patents) text-based clustering & analysis of citation network structure Case study: Ljung  3. Trend Detection  assess trends & emerging fields from “change over time” in structure and characterization of clusters & citation network Dries Van Dromme; Frizo Janssens

 Text SCD Software  Preprocessing &Indexing  Lucene & TextPack  Search engine and webservices  TXTGate and McKnow

 Text SCD Publications targetted submissions by Dec  Bio-informatics (1-2)  (BMC) bioinformatics, special issues,.. (BC)  More biological journals (BC, SVV)  Knowledge management (1)  Scientometrics, SIAM DM,  Bibliometrics & scientometrics (1)  Case study  Bioinformatics, Trends in..  IEEE transactions, engineering, webmining journals  SIAM DM High, moderate, fair impact

 Text SCD Collaborations  Formalized  GBOU-McKnow partner CIB olv Joost Duflou (Joris Vertommen, Dries Cleymans) User Committee (ICMS, Verhaert, LMS, TriSoft, WTCM)  IWT met Joris V (Steven: aanvullen/corrigeren)  Steunpunt O&O Statistieken, INCENTIM Patent clustering and detection of emerging trends  Informal  M-F Moens (SBO ?)  IBM – Bart VL  Gasthuisberg en Peter M: TXTGate als ‘vak’  J&J