Download presentation
Presentation is loading. Please wait.
Published byHilda Sherman Modified over 9 years ago
1
. Algorithms in Computational Biology (236522) Spring 2002 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours Wednesday 1630-1730 TA: Ydo Wexler, Taub 431, tel 4927 Office hours Monday 1030-1130 Lecture: Tuesday 11:30-13:30, Taub 2 Tutorial: Monday 9:30-10:30, Taub 4
2
2 Course Information Requirements & Grades: u 15-25% homework, in five theoretical question sets. [Submit in two weeks time]. Homework is obligatory. u 75-85% test. Must pass beyond 55 for the homework’s grade to count u Exam date: 7.7.04.
3
3 Bibliography u Biological Sequence Analysis, R.Durbin et al., Cambridge University Press, 1998 u Introduction to Molecular Biology, J. Setubal, J. Meidanis, PWS publishing Company, 1997 u Phylogenetics, C. Semple, M. Steel, Oxford press, 2003 u url: www.cs.technion.ac.il/~cs236522www.cs.technion.ac.il/~cs236522
4
4 Course Prerequisites Computer Science and Probability Background u Data structure 1 (cs234218) u Algorithms 1 (cs234247) u Probability (any course) Some Biology Background u Formally: None, to allow CS students to take this course. u Recommended: Molecular Biology 1 (especially for those in the Bioinformatics track), or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material (see the course web site). Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.
5
5 Relations to Some Other Courses Bioinformatics Software (cs236523). The course Introduction to Bioinformatics covers practical aspects and hands on experience with many web-based bioinformatics programs. Albeit not a formal requirement, it is recommended that you look on the web site www.cs.technion.ac.il/~cs236606 and examine the relevant software. Bioinformatics algorithms (cs236522). This is the current course which focuses on modeling some bioinformatics problems and presents algorithms for their solution. Bioinformatics project (cs5236524). Developing bioinformatics tools under close guidance.
6
. Biological Background This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran. www.cs.huji.ac.il Solve questions 1-3, p. 30 (to be on the course web site) Due time: Tutorial class of 22.3.04 (~2 weeks from today), or earlier in the teaching assistant’s mail slot. First home work assignment: Read the first chapter (pages 1-30) of Setubal et al., 1997. (a copy is available in the Taub building library, and one for loan at Fishbach).
7
7 Computational Biology Computational biology is the application of computational tools and techniques to (primarily) molecular biology. It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics. Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting the field to molecular Biology only.
8
8 Examples of Areas of Interest Building evolutionary trees from molecular (and other) data Efficiently constructing genomes of various organisms Understanding the structure of genomes (SNP, SSR, Genes) Understanding function of genes in the cell cycle and disease Deciphering structure and function of proteins _____________________ SNP: Single Nucleotide Polymorphism SSR: Simple Sequence Repeat
9
9 Exponential growth of biological information: growth of sequences, structures, and literature.
10
10 Four Aspects Biological l What is the task? Algorithmic l How to perform the task at hand efficiently? Learning l How to adapt/estimate/learn parameters and models describing the task from examples Statistics l How to differentiate true phenomena from artifacts
11
11 Example: Sequence Comparison Biological l Evolution preserves sequences, thus similar genes might have similar function Algorithmic l Consider all ways to “align” one sequence against another Learning l How do we define “similar” sequences? Use examples to define similarity Statistics l When we compare to ~10 6 sequences, what is a random match and what is true one
12
12 Course Goals u Learning about computational tools for (primarily) molecular biology. u Cover computational tasks that are posed by modern molecular biology u Discuss the biological motivation and setup for these tasks u Understand the kinds of solutions that exist and what principles justify them
13
13 Topics I Dealing with DNA/Protein sequences: u Genome projects and how sequences are found u Finding similar sequences u Models of sequences: Hidden Markov Models u Transcription regulation u Protein Families u Gene finding
14
14 Topics II Models of genetic change: u Long term: evolutionary changes among species u Reconstructing evolutionary trees from sequences u Short term: genetic variations in a population u Finding genes by linkage and association
15
15 Topics III (if time allows) Protein World: u How proteins fold - secondary & tertiary structure u How to predict protein folds from sequences data u How to analyze proteins changes from raw experimental measurements (MassSpec)
16
16 Human Genome Most human cells contain 46 chromosomes: u 2 sex chromosomes (X,Y): XY – in males. XX – in females. u 22 pairs of chromosomes named autosomes.
17
17 DNA Organization Source: Alberts et al
18
18 The Double Helix Source: Alberts et al
19
19 DNA Components Four nucleotide types: u Adenine u Guanine u Cytosine u Thymine Hydrogen bonds (electrostatic connection): u A-T u C-G
20
20 Genome Sizes u E.Coli (bacteria)4.6 x 10 6 bases u Yeast (simple fungi) 15 x 10 6 bases u Smallest human chromosome 50 x 10 6 bases u Entire human genome 3 x 10 9 bases
21
21 Genetic Information u Genome – the collection of genetic information. u Chromosomes – storage units of genes. u Gene – basic unit of genetic information. They determine the inherited characters.
22
22 Genes The DNA strings include: u Coding regions (“genes”) l E. coli has ~4,000 genes l Yeast has ~6,000 genes l C. Elegans has ~13,000 genes l Humans have ~32,000 genes u Control regions l These typically are adjacent to the genes l They determine when a gene should be “expressed” u “Junk” DNA (unknown function - ~90% of the DNA in human’s chromosomes)
23
23 The Cell All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.
24
24 Example: Tissues in Stomach How is this variety encoded and expressed ?
25
25 Central Dogma Transcription mRNA Translation Protein Gene cells express different subset of the genes In different tissues and under different conditions שעתוק תרגום
26
26 Transcription u Coding sequences can be transcribed to RNA u RNA nucleotides: l Similar to DNA, slightly different backbone l Uracil (U) instead of Thymine (T) Source: Mathews & van Holde
27
27 Transcription: RNA Editing Exons hold information, they are more stable during evolution. This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma. 1.Transcribe to RNA 2.Eliminate introns 3.Splice (connect) exons * Alternative splicing exists
28
28 RNA roles u Messenger RNA (mRNA) l Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block). u Transfer RNA (tRNA) l Decodes the mRNA molecules to amino-acids. It connects to the mRNA with one side and holds the appropriate amino acid on its other side. u Ribosomal RNA (rRNA) l Part of the ribosome, a machine for translating mRNA to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created. u...
29
29 Translation u Translation is mediated by the ribosome u Ribosome is a complex of protein & rRNA molecules u The ribosome attaches to the mRNA at a translation initiation site u Then ribosome moves along the mRNA sequence and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.
30
30 Genetic Code There are 20 amino acids from which proteins are build.
31
31 Protein Structure u Proteins are poly- peptides of 70-3000 amino-acids u This structure is (mostly) determined by the sequence of amino-acids that make up the protein
32
32 Protein Structure
33
33 Evolution u Related organisms have similar DNA l Similarity in sequences of proteins l Similarity in organization of genes along the chromosomes u Evolution plays a major role in biology l Many mechanisms are shared across a wide range of organisms l During the course of evolution existing components are adapted for new functions
34
34 Evolution Evolution of new organisms is driven by u Diversity l Different individuals carry different variants of the same basic blue print u Mutations l The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. u Selection bias
35
35 The Tree of Life Source: Alberts et al
36
36 Example for Phylogenetic Analysis Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ? AGA AAA GGA AAG AAA 2 1 1 Total #substitutions = 4 One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetic tree).
37
37 Example Continued There are many trees possible. For example: AGA GGA AAA AAG AAA AGA AAA 1 1 1 Total #substitutions = 3 GGA AAA AGA AAG AAA 1 1 2 Total #substitutions = 4 The left tree is “better” than the right tree. Questions: Is this principle yielding realistic phylogenetic trees ? (Evolution) How can we compute the best tree efficiently ? (Computer Science) What is the probability of substitutions given the data ? (Learning) Is the best tree found significantly better than others ? (Statistics)
38
38 Characters in Species u A (discrete) character is a property which distinguishes between species (e.g. dental structure, a certain gene) u A characters state is a value of the character (human dental structure). u Problem: Given set of species, specified by their characters, reconstruct their evolutionary tree.
39
39 Species ≡ Vertices States ≡ Colors Characters ≡ Colorings Evolutionary tree ≡ A tree with many colorings, containing the given vertices
40
40 Evolutionary trees should avoid reversal transitions u A species regains a state it’s direct ancestor has lost. u Famous examples: l Teeth in birds. l Legs in snakes.
41
41 Evolutionary trees should avoid convergence transitions u Two species possess the same state while their least common ancestor possesses a different state. u Famous example: The marsupials.
42
42
43
43 Common Assumption: Characters with Reversal or Convergent transitions are highly unlikely in the Evolutionary Tree A character that exhibits neither reversals nor convergence is denoted homoplasy free.
44
44 A character is Homoplasy Free ↕ The corresponding coloring is convex (each color induces a block)
45
45 A partial coloring is convex if it can be completed to a (total) convex coloring
46
46 The Perfect Phylogeny Problem u Input: a set of species, and many characters, each assign states (colors) to the species. u Question: is there a tree T containing the species as vertices, in which all the characters (colorings) are convex?
47
47 Input: Some colorings (C 1,…,C k ) of a set of vertices (in the example: 3 colorings: left, center, right, each by (the same) two colors). Problem: Is there a tree T which includes these vertices, s.t. (T,C i ) is convex for i=1,…,k? RBRRBRRRR BBRRRB The Perfect Phylogeny Problem (combinatorial setting) NP-Hard In general, in P for some special cases
48
. Werner’s Syndrome A successful application of genetic analysis for Gene Hunting
49
49 The Disease u First references in 1960s u Causes premature ageing u Autosomal recessive u Linkage studies from 1992 u WRN gene cloned in 1996 u Subsequent discovery of mechanisms involved in wild-type and mutant proteins
50
50 Marker Distance Distance from prior from first DHS133 0.0 D8S136 7.6 7.6 D8S137 7.4 15.0 D8S131 0.9 15.9 D8S339 6.7 22.6 D8S259 1.6 24.2 FGFR 2.5 26.7 D8S255 2.8 29.5 ANK 2.1 31.6 PLAT 2.8 34.4 D8S165 11.4 45.8 D8S166 1.0 46.8 D8S164 43.8 90.6 Identifying the Marker/s u Match most ‘likely’ cumulative distance against cumulative distances from marker file. u Distance 22.6cM (centi Morgans) fell exactly on the marker D8S339.
51
51 Locating D8S339 u Position of marker D8S339 was unknown. u But positions of the adjacent markers D8S131 and D8S259 were known. u Recombination distances from D8S339 to both D8S131 and D8S259 are given. u By assuming recombination physical distance, we estimate position of D8S339 in the next drawing.
52
52 Results D8S131 Marker Known Position D8S259 Marker Known Position D8S339 Estimated Position (1993) WRN Actual Position (1996) http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr8:32213515-38608031 Linkage accuracy: ~1,250,000 bp
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.