Copyright © 2005 by Limsoon Wong Building Gene Networks by Information Extraction, Cleansing, & Integration Limsoon Wong Institute for Infocomm Research.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Copyright © 2005 by Limsoon Wong Building Gene Networks by Information Extraction, Cleansing, & Integration Limsoon Wong Institute for Infocomm Research.
Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
The Cell, Central Dogma and Human Genome Project.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
On line (DNA and amino acid) Sequence Information
AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.
Bioinformatics.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
BLAST What it does and what it means Steven Slater Adapted from pt.
Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Biological Databases By : Lim Yun Ping E mail :
Finish up array applications Move on to proteomics Protein microarrays.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland.
MedKAT Medical Knowledge Analysis Tool December 2009.
Bioinformatics and Computational Biology
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Introduction to PubChem BioAssay
Language Identification and Part-of-Speech Tagging
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
CRF &SVM in Medication Extraction
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Copyright © 2005 by Limsoon Wong Building Gene Networks by Information Extraction, Cleansing, & Integration Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong. Plan Motivation for study of gene networks Example Efforts at I 2 R –Disease Pathweaver –Dragon ERG Solution Technical Challenges Involved –Name entity recognition –Co-reference resolution –Protein interaction extraction –Database cleansing

Copyright © 2005 by Limsoon Wong Motivation

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng Some Useful Information Sources Disease-centric resources –OMIM [NCBI OMIM, 2004]NCBI OMIM, 2004 –Gene2Disease [Perez-Iratxeta et al., Nature, 2002]Perez-Iratxeta et al., Nature, 2002 –MedGene [Hu et al., J. Proteome Res., 2003]Hu et al., J. Proteome Res., 2003 Emphasized direct gene- disease relationships –Provide lists of disease- related genes –Do not provide info on gene-gene interactions & their networks Related interaction resources –KEGG [Kanehisa, NAR, 2000]Kanehisa, NAR, 2000 Manually constructed protein interaction networks Mostly metabolic pathways and few disease pathways –Only 7 disease pathways

Monogenic  Heterogenic Why Gene Network? Many common diseases are –not caused by a genetic variation within a single gene But are influenced by –complex interactions among multiple genes –environmental & lifestyle factors Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng

Copyright © 2005 by Limsoon Wong. Desired Outcome of Gene Network Study Help scientists understand the mechanism of complex diseases by –Greatly reducing work load for primary study of genetic diseases, broaden the scope of molecular studies –Easily identifying key players in the gene network, help in finding potential drug targets Scalable framework –Extend to many genetic diseases –Include other resources of gene interactions Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng

Copyright © 2005 by Limsoon Wong Some Gene Network Study Efforts at I 2 R Disease Pathweaver Dragon ERG Solution

Copyright © 2005 by Limsoon Wong Some Gene Network Study Efforts at I 2 R Disease Pathweaver Dragon ERG Solution

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver, Zhang et al. APBC 2005 Automatically constructing disease pathways –Identify core genes –Mine info on core genes –Construct interaction network betw core genes Data sources: –Online literature –High-thru’put biological data

Copyright © 2005 by Limsoon Wong. Disease Pathweaver: The Portal Portal for human nervous diseases gene networks – Statistics –37 Human Nervous System Disorders –7 ~ 60 core genes per disease –2 ~ 320 core interactions per disease Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour

Disease Pathweaver: DPW vs KEGG on Huntington Disease Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng

Copyright © 2005 by Limsoon Wong Some Gene Network Study Efforts at I 2 R Disease Pathweaver Dragon ERG Solution

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Estrogen-Responsive Genes Why –Affects human physiology in many aspects –Related to many diseases –Widely used in clinic Challenges –Multiple pathways –Difficult to predict ERE –Many estrogen-responsive genes but only a few are well-studied –Difficult to keep up w/ speed of knowledge accumulation  Needed –Tools to predict ERE & estrogen-responsive genes –Database of useful info –Systems to predict impt regulatory units, associate gene functions, & generate global view of gene network

Copyright © 2005 by Limsoon Wong. Dragon ERG Solution ERE dependent E2 dependent ERs bind to other TFs Membrane receptors E2 independent ERE independent Coming soon! ERE Finder ERG Finder ERGDB ERG Explorer Adapted w/ permission from Suisheng Tang & Vlad Bajic text mining here!

 Predict functional ERE in genomic DNA  One prediction in 13.3k bp  Allow further analysis Dragon ERG Solution: Dragon ERE Finder, Bajic et al, NAR, 2003 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic

 Only for human genome  Using 117 bp ERE frame  Evaluated by PubMed & microarray data Dragon ERG Solution: Dragon Estrogen-Responsive Gene Finder, Tang et al, NAR, 2004a Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic

 Contains >1000 genes  Manually curated  Basic gene info  Experimental evidence  Full set of references  ERE sites annotated Dragon ERG Solution: Dragon Estrogen-Responsive Genes Database, Tang et al, NAR, 2004b Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution: DEERGF

Copyright © 2005 by Limsoon Wong. Dragon ERG Solution: Case-Specific TF relation networks, Pan et al, NAR, 2004 Analyse abstracts Stemming, POS tagging Use ANNs, SVM, discriminant analysis Simplified rules for sentence analysis Constraints on the forms of sentences Sensitivity ~75% Precision ~82% Produce reports & direct links to PubMed docs, & graphical presentations of entity links Adapted w/ permission from Vlad Bajic

Copyright © 2005 by Limsoon Wong Technical Challenges Named entity recognition Co-reference resolution Protein interaction extraction Data cleansing

Copyright © 2005 by Limsoon Wong Technical Challenges Named entity recognition Co-reference resolution Protein interaction extraction Data cleansing

Bio Entity Name Recognition, Zhou et al., BioCreAtIvE, 2004 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Jian Su

Copyright © 2005 by Limsoon Wong. Bio Entity Name Recognition: Ensemble Classification Approach Features considered –orthographic, POS, morphologic, surface word, trigger words (TW 1 : receptor, enhancer, etc. TW 2 : activation, stimulation, etc.) SVM –Context of 7 words –Each word gives 5 features, plus its position HMMs –3 features used (orthographic, POS, surface word) –HMM 1 & HMM 2 use POS taggers trained on diff corpora HMM 1 balanced precision & recall HMM 2 low precision & high recall SVM high precision & low recall Majority Voting Orthographic features: capital letters, numeric characters, & their combinations Morphologic features: prefix, suffix, etc.

Bio Entity Name Recognition: Performance at BioCreAtIvE 2004 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou et al., BioCreAtIve 2004

Copyright © 2005 by Limsoon Wong Technical Challenges Named entity recognition Co-reference resolution Protein interaction extraction Data cleansing

Co-Reference Resolution, Yang et al., IJCNLP, 2004 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Yang et al., IJCNLP 2004

Co-Reference Resolution: Baseline Features Used Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Yang et al., IJCNLP 2004

Co-Reference Resolution: New Features Used & Performance Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Yang et al., IJCNLP 2004 Base Classifier: C5.0

Copyright © 2005 by Limsoon Wong Technical Challenges Named entity recognition Co-reference resolution Protein interaction extraction Data cleansing

The max entropy model: where –o is the outcome –h is the feature vector –Z(h) is normalization function –f j are feature functions –  j are feature weights Protein Interaction Extraction, Xiao et al, submitted Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Xiao et al., IJCNLP 2004

Copyright © 2005 by Limsoon Wong. Protein Interaction Extraction: Features Used Adapted w/ permission from Xiao et al.

Copyright © 2005 by Limsoon Wong. Protein Interaction Extraction: Performance on IEPA Corpus Adapted w/ permission from Xiao et al.

Copyright © 2005 by Limsoon Wong Technical Challenges Named entity recognition Co-reference resolution Protein interaction extraction Data cleansing

Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Judice Koh, Mong Li Lee, & Vladimir Brusic Data Cleansing, Koh et al, DBiDB, types & 28 subtypes of data artifacts –Critical artifacts (vector contaminated sequences, duplicates, sequence structure violations) –Non-critical artifacts (misspellings, synonyms) > 20,000 seq records in public contain artifacts Identification of these artifacts are impt for accurate knowledge discovery Sources of artifacts –Diverse sources of data Repeated submissions of seqs to db’s Cross-updating of db’s –Data Annotation Db’s have diff ways for data annotation Data entry errors can be introduced Different interpretations –Lack of standardized nomenclature Variations in naming Synonyms, homonyms, & abbrevn –Inadequacy of data quality control mechanisms Systematic approaches to data cleaning are lacking

Data Cleansing: A Classification of Errors Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Spelling errors Format violation Annotation error Dubious sequences Sequence redundancy Data Provenance flaws Cross- annotation error Sequence structure violation Usually typo errors Occurs in different fields of the record We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez. Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Data Cleansing: Example Spelling Errors Context of the misspellingsCorrectionsMisspellings EMBL:Y18050 E.faecium pbp5 gene TITLE Modification of penicillin-binding protein 5 asociated with high level ampicillin resistance in Enterococcus faecium gi| |emb|X |EFPBP5G associatedasociated Swiss-Prot:P03385 Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70); Tranmembrane protein (TM) (p15E); R protein]. gi|119478|sp|P03385|ENV_MLVMO transmembranetranmembrane Patent Database:A76783 Sequence 11 from Patent WO CDS < /note="gene cassete encoding intercalating jun-zipper and linker" gi| |emb|A ||pat|WO| |11[ ] Cassette Cassete GenBank:AAD26534 nectin-1 [Rattus norvegicus] TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited to Cadherin-based Adherens Junctions through Interaction with Afadin, a PDZ Domain-containing Protein gi| |gb|AAD Immunoglobulin Immunoglobin Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Uninformative sequences Undersized sequences Annotation error Dubious sequences Sequence redundancy Data provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases, 3,327 protein sequences are shorter than four residues (as of Sep, 2004). In Nov 2004, the total number of undersized protein sequences increases to 3,350. Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004). In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711. Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Data Cleansing: Example Meaningless Seqs

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Overlapping intron/exon Annotation error Dubious sequences Sequence redundancy Data Provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Syn7 gene of putative polyketide synthase in NCBI TPA record BN has overlapping intron 5 and exon 6. rpb7+ RNA polymerase II subunit in GENBANK record AF has overlapping exon 1 and exon 2. Data Cleansing: Example Overlapping Intron/Exon Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Annotation error Dubious sequences Sequence redundancy Data provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Replication of sequence information Different views Overlapping annotations of the same sequence Submission of the same sequence to different databases Repeated submission of the same sequence to the same database Initially submitted by different groups Protein sequences may be translated from duplicate nucleotide sequences =protein&list_uids= &dopt=GenPept Data Cleansing: Example Seqs w/ Identical Info Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Annotation error Dubious sequences Sequence redundancy Data provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Replication of sequence information Different views Overlapping annotations of the same sequence Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Data Cleansing: Overlapping Annotations of the Same Seqs

SINGLE- SOURCE DATABASE ATTRIBUTE Schema remapping Sequence Structure Parser RECORD Concatenated values Spelling errors Format violation Synonyms Homonyms/Abbreviations Misuse of fields Features do not correspond with sequence Dictionary lookup Integrity constraints Undersized sequences Uninformative sequences Replication of sequence information Different views Overlapping annotations of the same sequence Fragments Duplicate detection Mis-fielded values Comparative analysis Sequence structure violation MULTI- SOURCE DATABASE Vector contaminated sequences Vector screening Putative features Cross-annotation error Trying our hands on a data cleansing problem... Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2005 by Limsoon Wong. Select matching criteria Compute similarity scores from known duplicate pairs Generate association rules Detect duplicates using the rules Association Rule Mining Approach Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Features to Match Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2005 by Limsoon Wong. AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0 AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0 P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0 S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0 P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 LE1.0 PD0 SQ1.0 (99.7%) SP1 PD0 SQ1.0 (97.1%) SP1 LE1.0 PD0 SQ1.0 (96.8%) DB0 PD0 SQ1.0 (93.1%) DB0 LE1.0 PD0 SQ1.0 (92.8%) DB0 SP1 PD0 SQ1.0 (90.4%) DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%) RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%) RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%) Similarity scores of known duplicate pairs Association rule mining Frequent item-set with support Association Rule Mining Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2005 by Limsoon Wong. Scorpion venom dataset containing 520 records 695 duplicate pairs are collectively identified. Snake PLA2 venom dataset containing 780 records Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB) 251 duplicate pairs 444 duplicate pairs scorpion AND (venom OR toxin) serpentes AND venom AND PLA2 Expert annotation Dataset Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2005 by Limsoon Wong. Exam 3 rules w/ highest support S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7 S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6 S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5 S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1 Results Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2005 by Limsoon Wong. Rule 1S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. Rule 2S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates. What else do we learn? Definition of the sequence records do not play a role in identifying the record duplicates Results Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2005 by Limsoon Wong. References Zhang et al., “Toward discovering disease-specific gene networks from online literature”, APBC, 3: , 2005 Pan et al., “Dragon TF association miner: A system for exploring transcription factor associations through text mining”, NAR, 32:W230-W234, 2004 Tang et al., “Computational method for discovery of estrogen- responsive genes”, NAR, 32: , 2004a Tang et al., “ERGDB: Estrogen-responsive genes database”, NAR, 32:D533-D536, 2004b Bajic et al., “Dragon ERE finded ver.2: A tool for accurate detection and analysis of estrogen-response elements in vertebrate genomes”, NAR, 31: , 2003 Koh et al., “A Classification of Biological Data Artifacts”, DBiBD, 2005

Copyright © 2005 by Limsoon Wong. References Zhou et al., “Recognition of protein and gene names from text using an ensemble of classifiers and effective abbreviation resolution”, Proc. BioCreAtIvE Workshop, pp 26-30, 2004 Yang et al., “Improving Noun Phrase Co-reference Resolution by Matching Strings”, IJCNLP, 1: , 2004 Xiao et al., “Protein-protein interaction extraction: A supervised learning approach”, submitted

I2RI2R Communications & DevicesServices & ApplicationsMedia Processing Human Centric Media Semantics Infocomm Security Context-Aware Systems Knowledge Discovery Radio SystemsNetworkingLightwaveEmbedded Systems Digital Wireless Acknowledgements Copyright © 2005 by Limsoon Wong Data Cleansing: Judice Koh, Vladimir Brusic, Mong Li Lee, Asif M. Khan, Paul T.J. Tan, Heiny Tan, Kenneth Lee, Wilson Goh, Songsak Tongchusak, Kavitha Gopalakrishnan Pathweaver: Zhuo Zhang, See-Kiong Ng, Suisheng Tang, Chris Tan Dragon ERG & TF Miner: Suisheng Tang, Vlad Bajic, Zuo Li, Pan Hong, Vidhu Chaudhary, Raja Kangasa Info Extraction: Guodong Zhou, Jian Su, ChewLim Tan, Juan Xiao, Xiaofeng Yang, Chris Tan, Dan Shen, Jie Zhang