Algorithms for Smoothing Array CGH data

Slides:



Advertisements
Similar presentations
CS6800 Advanced Theory of Computation
Advertisements

Exact and heuristics algorithms
CMPUT 466/551 Principal Source: CMU
SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative gnome hybridization data.
Tumour karyotype Spectral karyotyping showing chromosomal aberrations in cancer cell lines.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Non-Linear Problems General approach. Non-linear Optimization Many objective functions, tend to be non-linear. Design problems for which the objective.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Mutual Information Mathematical Biology Seminar
Department of Engineering, Control & Instrumentation Research Group 22 – Mar – 2006 Optimisation Based Clearance of Nonlinear Flight Control Laws Prathyush.
Introduction to Genetic Algorithms Yonatan Shichel.
1 Genetic Algorithms. CS The Traditional Approach Ask an expert Adapt existing designs Trial and error.
Intro to AI Genetic Algorithm Ruth Bergman Fall 2002.
Tutorial 1 Temi avanzati di Intelligenza Artificiale - Lecture 3 Prof. Vincenzo Cutello Department of Mathematics and Computer Science University of Catania.
Evolutionary Computation Application Peter Andras peter.andras/lectures.
1 Genetic Algorithms. CS 561, Session 26 2 The Traditional Approach Ask an expert Adapt existing designs Trial and error.
Genetic Algorithms Nehaya Tayseer 1.Introduction What is a Genetic algorithm? A search technique used in computer science to find approximate solutions.
Chapter 6: Transform and Conquer Genetic Algorithms The Design and Analysis of Algorithms.
Genetic Programming.
Evolutionary algorithms
Genetic Algorithm.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Genetic Algorithms by using MapReduce
Zorica Stanimirović Faculty of Mathematics, University of Belgrade
Genetic Algorithms Michael J. Watts
What is Genetic Programming? Genetic programming is a model of programming which uses the ideas (and some of the terminology) of biological evolution to.
Genetic Algorithms Genetic algorithms imitate a natural optimization process: natural selection in evolution. Developed by John Holland at the University.
The Generational Control Model This is the control model that is traditionally used by GP systems. There are a distinct number of generations performed.
Applying Genetic Algorithm to the Knapsack Problem Qi Su ECE 539 Spring 2001 Course Project.
1/27 Discrete and Genetic Algorithms in Bioinformatics 許聞廉 中央研究院資訊所.
Computational Complexity Jang, HaYoung BioIntelligence Lab.
Genetic Algorithms Introduction Advanced. Simple Genetic Algorithms: Introduction What is it? In a Nutshell References The Pseudo Code Illustrations Applications.
Chapter 4.1 Beyond “Classic” Search. What were the pieces necessary for “classic” search.
Genetic Algorithms Siddhartha K. Shakya School of Computing. The Robert Gordon University Aberdeen, UK
Genetic Algorithms. Evolutionary Methods Methods inspired by the process of biological evolution. Main ideas: Population of solutions Assign a score or.
Learning by Simulating Evolution Artificial Intelligence CSMC February 21, 2002.
1 Genetic Algorithms K.Ganesh Introduction GAs and Simulated Annealing The Biology of Genetics The Logic of Genetic Programmes Demo Summary.
Genetic Algorithms Czech Technical University in Prague, Faculty of Electrical Engineering Ondřej Vaněk, Agent Technology Center ZUI 2011.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Genetic Algorithms Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit.
Genetic Algorithms What is a GA Terms and definitions Basic algorithm.
ECE 103 Engineering Programming Chapter 52 Generic Algorithm Herbert G. Mayer, PSU CS Status 6/4/2014 Initial content copied verbatim from ECE 103 material.
Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.
Genetic Algorithms. The Basic Genetic Algorithm 1.[Start] Generate random population of n chromosomes (suitable solutions for the problem) 2.[Fitness]
Waqas Haider Bangyal 1. Evolutionary computing algorithms are very common and used by many researchers in their research to solve the optimization problems.
Alice E. Smith and Mehmet Gulsen Department of Industrial Engineering
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
CGH Data BIOS Chromosome Re-arrangements.
N- Queens Solution with Genetic Algorithm By Mohammad A. Ismael.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Genetic Algorithms. Underlying Concept  Charles Darwin outlined the principle of natural selection.  Natural Selection is the process by which evolution.
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
Selection and Recombination Temi avanzati di Intelligenza Artificiale - Lecture 4 Prof. Vincenzo Cutello Department of Mathematics and Computer Science.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Artificial Intelligence By Mr. Ejaz CIIT Sahiwal Evolutionary Computation.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
EVOLUTIONARY SYSTEMS AND GENETIC ALGORITHMS NAME: AKSHITKUMAR PATEL STUDENT ID: GRAD POSITION PAPER.
Genetic Algorithms And other approaches for similar applications Optimization Techniques.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Genetic (Evolutionary) Algorithms CEE 6410 David Rosenberg “Natural Selection or the Survival of the Fittest.” -- Charles Darwin.
Genetic Algorithms Schematic of neural network application to identify metabolites by mass spectrometry (MS) Developed by Dr. Lars Kangas Input to Genetic.
Introduction to Genetic Algorithms
Genetic Algorithms.
Bulgarian Academy of Sciences
Modified Crossover Operator Approach for Evolutionary Optimization
Example: Applying EC to the TSP Problem
Lecture 21 Gene expression and the transcriptome II
Beyond Classical Search
Presentation transcript:

Algorithms for Smoothing Array CGH data Kees Jong (VU, CS and Mathematics) Elena Marchiori (VU, Computer Science) Aad van der Vaart (VU, Mathematics) Gerrit Meijer (VUMC) Bauke Ylstra (VUMC) Marjan Weiss (VUMC) Estimated time: < 1 minute

Tumor Cell Chromosomes of tumor cell: Normal cells 23 pairs of chromosomes. Each chromosome is built from two strands of bases labeled (ag tc) Two connected bases are referred to as basepair Male: XY Female: XX Some pieces of chromosomal DNA form genes. Processes in the cell ultimately form proteins from genes, which determine the way a cell behaves. Each chromosome labeled with different color (SKY experiment). You can see this cell does not at all have 2 copies of each chromosome… Pieces are lost, pieces have attached to other chromosomes, some pieces have more than 2 copies.

CGH Data  C o p y # Clones/Chromosomes  Estimated time: 1 minute Explain axis. About the data: Normalized (average 1) Log2 (why?) “1It is preferable to work with logged intensities rather than absolute intensities for a number of reasons including the facts that: (i) the variation of logged intensities and ratios of intensities is less dependent on absolute magnitude; (ii) normalization is additive for logged intensities; (iii) taking logs evens out highly skew distributions; and (iv) taking logs gives a more realistic sense of variation.” Point out some gains, losses and amplifications Clones/Chromosomes 

Naïve Smoothing

“Discrete” Smoothing Copy numbers are integers

Why Smoothing ? Noise reduction Detection of Loss, Normal, Gain, Amplification Breakpoint analysis Estimated time: about 1 minute Explain and point at breakpoints: place where value changes. Smoothing also makes it possible to do analysis on breakpoints. Recurrent (over tumors) aberrations may indicate: an oncogene or a tumor suppressor gene

Is Smoothing Easy? Measurements are relative to a reference sample Printing, labeling and hybridization may be uneven Tumor sample is inhomogeneous Estimated time: 2 minutes Unfortunately there are some problems with the experiment just described that prevent us from simply reading the copy number from the ratio’s. 1) - Not every spot on the slide has the same amount of clones on it. 2) In a tumor not every cell is the same, normal cells may occur, tumor cells of a previous stadium of the tumor may occur. The amounts of cells for a type of cell is unknown. - The amount of DNA material taken from the tumor cells may not be exactly equal to the amount taken from normal cells. - The “green” and “red” molecules do not stick equally well 3) Not every labeled fragment can stick stick to a clone, because it is on a part of the chromosome that does not occur on a clone. Not every part that can stick to a clone, does actually stick 4) Every clone has three spots on the slide. If we assume that red and green fragments same distribution over the spots, the three spots should have the same value. Which is almost the case, since there is only a small standard deviation. Problem: Assume lowest line is normal What is the real situation? 1 Cell type, then the middle line is single gain, the highest line is a double gain 2 Cell types, then possibly Type 1 occurs twice as much as type 2 and has a gain corresponding to the upper line Type 2 has a gain corresponding to the middle line vertical scale is relative do expect only few levels

Smoothing: example Estimated time: < 20 seconds.

Problem Formalization A smoothing can be described by a number of breakpoints corresponding levels A fitness function scores each smoothing according to fitness to the data An algorithm finds the smoothing with the highest fitness score. Estimated time: Details of function to optimize later. What are we trying to model? Do we want to find smoothings that resemble the expert or remove experimental noise? Unfortunately not much known about the properties of the processes that take place during the experiments. We assume… We do not have tumors with detailed data about the real composition of cell types, but the model seems to resemble the expert quite well as we will see. We do have CGH data of a normal-normal experiment. A normal probability plot of that data shows that the noise in that experiment follows a normal distribution quite well.

Smoothing breakpoints variance levels

Fitness Function We assume that data are a realization of a Gaussian noise process and use the maximum likelihood criterion adjusted with a penalization term for taking into account model complexity Estimated time: Details of function to optimize later. What are we trying to model? Do we want to find smoothings that resemble the expert or remove experimental noise? Unfortunately not much known about the properties of the processes that take place during the experiments. We assume… We do not have tumors with detailed data about the real composition of cell types, but the model seems to resemble the expert quite well as we will see. We do have CGH data of a normal-normal experiment. A normal probability plot of that data shows that the noise in that experiment follows a normal distribution quite well. We could use better models given insight in tumor pathogenesis

Fitness Function (2) likelihood: CGH values: x1 , ... , xn breakpoints: 0 < y1< … < yN < xN levels: m1, . . ., mN error variances: s12, . . ., sN2 Estimated time: ? Assume independence of all observations… likelihood:

Fitness Function (3) Maximum likelihood estimators of μ and s 2 can be found explicitly Need to add a penalty to log likelihood to control number N of breakpoints penalty Estimated time: ? * Goal: Add penalty to make sure not every index becomes a breakpoint.

Algorithms Maximizing Fitness is computationally hard Use genetic algorithm + local search to find approximation to the optimum Estimated time: ?

Algorithms: Local Search choose N breakpoints at random while (improvement) - randomly select a breakpoint - move the breakpoint one position to left or to the right Estimated time: ?

Genetic Algorithm Given a “population” of candidate smoothings create a new smoothing by - select two “parents” at random from population - generate “offspring” by combining parents (e.g. “uniform crossover” or “union”) - apply mutation to each offspring - apply local search to each offspring - replace the two worst individuals with the offspring Estimated time: ? Encoding: 0/1 per clone. Label a clone as breakpoint. Termination criterion (both GAs): best fitness in pool does not improve and “individuals” in pool are very similar (there is no pair of individuals that have (after smoothing) a pair of clones that differs at least epsilon in smooth cgh value (average of section)) The mutation is as follows: flip coin to either: remove the breakpoint between 2 consecutive sections with best score afterwards. Insert a breakpoint in the middle of the section with the highest standard deviation.

Experiments Comparison of GLS GLSo Multi Start Local Search (mLS) Multi Start Simulated Annealing (mSA) GLS is significantly better than the other algorithms. Estimated time: ? Data used: about 25 tumors, about 2000 clones/tumor. 23 chroms/tumor. Run the algorithm for each chromosome separately. Explain mLS Explain mSA Test: sign test.

Comparison to Expert algorithm expert

Relating to Gene Expression

Relating to Gene Expression

Algorithms for Smoothing Array CGH data Kees Jong (VU, CS and Mathematics) Elena Marchiori (VU, CS) Aad van der Vaart (VU, Mathematics) Gerrit Meijer (VUMC) Bauke Ylstra (VUMC) Marjan Weiss (VUMC) Estimated time: < 1 minute

Conclusion Breakpoint identification as model fitting to search for most-likely-fit model given the data Genetic algorithms + local search perform well Results comparable to those produced by hand by the local expert Future work: Analyse the relationship between Chromosomal aberrations and Gene Expression Estimated time: 1 minute

Example of a-CGH Tumor  V a l u e Clones/Chromosomes  Estimated time: 1 minute Explain axis. About the data: Normalized (average 1) Log2 (why?) “1It is preferable to work with logged intensities rather than absolute intensities for a number of reasons including the facts that: (i) the variation of logged intensities and ratios of intensities is less dependent on absolute magnitude; (ii) normalization is additive for logged intensities; (iii) taking logs evens out highly skew distributions; and (iv) taking logs gives a more realistic sense of variation.” Point out some gains, losses and amplifications Clones/Chromosomes 

a-CGH vs. Expression a-CGH DNA DNA on slide In Nucleus Same for every cell DNA on slide Measure Copy Number Variation Expression RNA In Cytoplasm Different per cell cDNA on slide Measure Gene Expression Estimated time: <= 2 minutes This slide explains the difference between Array CGH and Microarray gene expression experiments. Explain Copy Number: The copy number of a piece of DNA on the genome is the number of times this piece occurs in the nucleus of a cell.

Breakpoint Detection Identify possibly damaged genes: These genes will not be expressed anymore Identify recurrent breakpoint locations: Indicates fragile pieces of the chromosome Accuracy is important: Important genes may be located in a region with (recurrent) breakpoints Estimated time: ?

Experiments Both GAs are Robust: Both GAs Converge: Over different randomly initialized runs breakpoints are (mostly) placed on the same location Both GAs Converge: The “individuals” in the pool are very similar Final result looks very much like (mean error = 0.0513) smoothing conducted by the local expert Estimated time: ?

Genetic Algorithm 1 (GLS) initialize population of candidate solutions randomly while (termination criterion not satisfied) - select two parents using roulette wheel - generate offspring using uniform crossover - apply mutation to each offspring - apply local search to each offspring - replace the two worst individuals with the offspring Estimated time: ? Encoding: 0/1 per clone. Label a clone as breakpoint. Termination criterion (both GAs): best fitness in pool does not improve and “individuals” in pool are very similar (there is no pair of individuals that have (after smoothing) a pair of clones that differs at least epsilon in smooth cgh value (average of section)) The mutation is as follows: flip coin to either: remove the breakpoint between 2 consecutive sections with best score afterwards. Insert a breakpoint in the middle of the section with the highest standard deviation.

Genetic Algorithm 2 (GLSo) initialize population of candidate solutions randomly while (termination criterion not satisfied) - select 2 parents using roulette wheel - generate offspring using OR crossover - apply local search to offspring - apply “join” to offspring - replace worst individual with offspring Estimated time: ? join: Repeatedly select breakpoint whose removal results in biggest improvement of fitness, until fitness does not decrease anymore.

Fitness function (2) likelihood: CGH values: x1 , ... , xn breakpoints: 0 < y1< … < yN < xN levels: m1, . . ., mN error variances: s12, . . ., sN2 Estimated time: ? Assume independence of all observations… likelihood: