Information Theoretical Probe Selection for Hybridisation Experiments

Slides:

Advertisements

Similar presentations

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.

Advertisements

Clustering Categorical Data The Case of Quran Verses

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Fingerprint Clustering - CPM Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri.

A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.

GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.

Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Cluster Analysis (1).

Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

Threshold Voltage Assignment to Supply Voltage Islands in Core- based System-on-a-Chip Designs Project Proposal: Gall Gotfried Steven Beigelmacher 02/09/05.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Blind Pattern Matching Attack on Watermark Systems D. Kirovski and F. A. P. Petitcolas IEEE Transactions on Signal Processing, VOL. 51, NO. 4, April 2003.

Paper: Large-Scale Clustering of cDNA-Fingerprinting Data Presented by: Srilatha Bhuvanapalli INFS 795 – Special Topics in Data Mining.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

Adaptive randomization

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.

Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.

MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Cluster validation Integration ICES Bioinformatics.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

Probe Selection Problems in Gene Sequences. (C) 2003, SNU Biointelligence Lab, DNA Microarrays cDNA: PCR from.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Applications of Tabu Search OPIM 950 Gary Chen 9/29/03.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu

Results for all features Results for the reduced set of features

Data Mining K-means Algorithm

Ch9: Decision Trees 9.1 Introduction A decision tree:

Research in Computational Molecular Biology , Vol (2008)

Parallel Density-based Hybrid Clustering

A Hybrid Algorithm for Multiple DNA Sequence Alignment

Evaluating classifiers for disease gene discovery

Microarray Clustering

Selection of Oligonucleotide Probes for Protein Coding Sequences

On Template Method for DNA Sequence Design

Chapter 8 Tutorial.

Hierarchical clustering approaches for high-throughput data

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Multivariate Statistical Methods

Subhayu Basu et al. , DNA8, (2002) MEC Seminar Su Dong Kim

Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Russell Deaton, junghuei Chen, hong Bi, and John A. Rose

Summarized by Sun Kim SNU Biointelligence Lab.

Approximation Algorithms for the Selection of Robust Tag SNPs

Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)

Restructuring Sparse High Dimensional Data for Effective Retrieval

Evaluating Classifiers for Disease Gene Discovery

RHEA Enhancements for GVGP

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Information Theoretical Probe Selection for Hybridisation Experiments Ralf Herwig et al. Bioinformatics, Vol.16, No. 10, 2000 Summarized by Sun Kim SNU Biointelligence Lab.

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Introduction (1/2) Oligonucleotide fingerprinting Fingerprint is characteristic for the individual clone Probe should be informative for the clone sequences in sense that all different genes can be distinguished by their fingerprints Probes should occur within the clone sequences with a considerable frequency Probes should not be to similar to each other. Probe selection according to high frequencies Lead to the agglomeration of probes that are highly similar to each other The gain in information is not significantly increased when selecting probes (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Introduction (2/2) This paper An information theoretical approach  probe design based on entropy maximization Performance of the probes with respect to clustering sequences by evaluating pairwise similarities of their fingerprints (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ System and Methods (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Data Preparation (1/2) Training set  probe set Test set  binary fingerprints , if probe j or its reverse complementary sequence matches clone sequence i , otherwise Five different test sets 685 different cDNA sequences from the GenBank database (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Data Preparation (2/2) Noise By flipping the respective amount of digits of the binary fingerprints Parameter 20% of true positive  0 20% of true negative  1 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Clustering Sequential k-means algorithm Using mutual information as a pairwise similarity measure for the binary fingerprints. Sequentially assigns each data point to the most similar cluster centroid from a set of previously calculated cluster centroids. Then the centroid is updated by the data point. Enriched by heuristics and algorithmic parameters Allow the merging of clusters and an introduction of new clusters in each step of the clustering process No need a pre-fixed initialization of the number of different clusters Simulation pipeline  Herwig et al. (1999) (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Validation of Clustering (1/2) True clustering  T , if and belong to the same cluster , otherwise Calculated clustering  C 2x2 contingency table to measure clustering quality (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Validation of Clustering (2/2) Measure  Jaccard-coefficient Perfect clustering: J(C,T) = 1 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Algorithm and Implementation (1/3) The fingerprints obtained with a single probe generate a partitioning of the N sequences into two subsets Those sequences that match with the probe sequence or its reverse complementary sequence, And those that do not The amount of information of the probe w.r.t the set of sequences  Entropy is the proportion of sequences that fall in the respective subset Maximizing when the subsets are equal sizes. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Algorithm and Implementation (2/3) Number of fingerprints increases as with the number p of probes Screening all possibilities is computationally unfeasible  Approximation suggested by R.Mott (Meier-Ewert, 1994) (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Algorithm and Implementation (3/3) Approximation Find the probe which partitions best the set of known sequences into two groups. Find the second probe which, together with the previously selected one, partitions the training set into four groups. Find the probe, which together with the previously selected ones, partitions best the training set. Stop, if the number of selected probes surmounts a given threshold or if each partition contains only one sequence. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Parameter LEN: length of probes. default = 8 N_GC: minimal number of G+C in each probe. default = 2 COMP: minimal complexity of probes, default = 0.5 OVL: maximal length of common stretch of basepairs shared by any two probes. default = 6 SEL: number of probes to be determined. default = 200 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Trained Probe Sequences (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Results: Frequency of Probes (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Results: Comparison of Probe Sets (1/2) Comparing Tested by human data and rodent data (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Results: Comparison of Probe Sets (2/2) Comparing Trained by human, rodent, and plant sequences each. (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Results: Variation of Algorithmic Parameters (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Conclusion Probe selection based on entropy. Dependent on the training set The training set should be chosen as close to the organism under analysis as possible. Good hybridization quality can be achieved, e.g. by G+C-rich probes. The proposed algorithm can be applied to any experiment Texts (sequences) are characterized by words (probes)  might be used to select characteristic keywords ? (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/