Presentation is loading. Please wait.

Presentation is loading. Please wait.

Welcome to CSE 527: Computational Biology

Similar presentations


Presentation on theme: "Welcome to CSE 527: Computational Biology"— Presentation transcript:

1 Welcome to CSE 527: Computational Biology
Lecture 1 – Sep 27, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022

2 Who is the instructor? Prof. Su-In Lee Research interests
Assistant Professor A joint faculty member Computer Science & Engineering, Genome Sciences Office hours: Wednesday 1:30-2:30 Research interests Developing machine learning techniques applied to Computational Biology (genetics, systems biology) Predictive Medicine, Translational Medicine

3 Teaching assistant Christopher Miles (CSE PhD student) Office: TBA
Office hours: Monday 1:30-2:30

4 What is the Coolest Thing a Computational/Mathematical Scientist Can Do?
Curing cancer. Understanding how the blue print of life (DNA) determines important traits (e.g. diseases)? Predicting your disease susceptibilities based on your biological information including DNA sequence. Predicting sudden changes in the condition of patients at ICU (intensive care unit). Determining the order of A,G,C and T in my 3-billion long DNA sequence. : CSE 527 will provide you with basic concepts and ML/statistical techniques that you can use to realize these goals.

5 More and More of Biology is Becoming an Information Science
A cell’s biological state can be described by millions of numbers! What biological discoveries we can make highly depends on the computational method we use to analyze the data. Machine learning techniques provide very effective tools. gene Gene (~30,000 in human) AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC Gene regulation AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC DNA AUGUGGAUUGUU AUGCGCGUC AUGUUACGCACCUAC AUGAUUGAU RNA Gene expression AUGAUUGAU MID MID AUGAUUAU AUGCGCGUC RNA degradation Protein MWIV MRV MLRTY MID MRV Cell: The basic unit of life Gene interaction map Biological information (data) DNA sequence information RNA levels of 30K genes Protein levels of 30K genes DNA molecule’s 3D structure :

6 Outline Course logistics
A zero-knowledge based introduction to biology Potential project topics

7 Goals of this course Introduction to Computational Biology
Basic concepts and scientific questions Basic biology for computational scientists In-depth coverage of ML techniques Current active areas of research Useful machine learning (ML) algorithms Probabilistic graphical models, clustering, classification Learning techniques (MLE, EM)

8 Topics in CSE 527 Part 1: Basic ML algorithms
Introduction to probabilistic models Bayesian networks, Hidden Markov models Representation and learning Part 2: Topics in computational biology and areas of active research Genetics, systems biology, predictive medicine, sequence analysis Finding genetic factors for complex biological traits Inferring biological networks from data Comparative genomics DNA/RNA sequence analysis

9 Course responsibilities
Class participation and attendance (10%) Good answers to the questions asked in class Initiating a productive discussion. Homework assignments (40%) Four problem sets Due at beginning of class Up to 3 late days (24-hr period) for the quarter Collaboration allowed Teams of 2 or 3 students Individual writeups Final project (50%) A group of up to two students.

10 Project overview (1/2) Topic Project deliverables
Choose from the list of project topics on the course website, or come up with your own. Open-ended Project deliverables Project proposal (due 10/19) Midterm report (due 11/16) Final report (due 12/14) Final presentations or poster session (12/7)

11 Project overview (2/2) Final report Short report (up to 10 pages)
Conference-style presentation Successful project reports can be submitted to computational biology/ ML conferences (ISMB, RECOMB, NIPS, ICML) Or journals (PLoS journals, Nature journals, PNAS, Genome Research and so on)

12 Reading material Lecture notes Biological background
Mostly based on recent papers & old seminar papers Biological background The Cell, a molecular approach by Copper Genetics, from genes to genomes by Hartwell and more Principles of Population genetics by Hartl & Clark Computational background Probabilistic graphical models by Profs. Daphne Koller & Nir Friedman Prof. Andrew Ng’s machine learning lecture note (cs229.stanford.edu) No textbook required for the course

13 Class resources Course website – cs.washington.edu/527 Mailing list
Lecture notes, assignments, project topics Deadlines of assignments and projects Mailing list

14 Outline Course logistics
A zero-knowledge based introduction to biology Prepared by George Asimenos (PhD student, Stanford) for CS262 Computational Genomics by Prof. Serafim Batzoglou (Stanford). Potential project topics

15 Cells: Building Blocks of Life
cell, nucleus, cytoplasm, mitochondrion Eukaryots: Plants, animals, humans DNA resides in the nucleus Contain other compartments for other specialized functions Prokaryots: Bacteria Do not contain compartments Little recognizable substructure Humans have 100 trillion cells (10^14) Prokaryotes: Bacteria and archaea. Lack cell nucleus (and usually unicellular). Instead they have a nucleoid, which is where all genetic material is (usually circular double-stranded DNA). Eukaryotes have nucleus. Cytoplasm: Entire contents of cell within plasma membrane: cytosol, organelles, and inclusions (random garbage like silicon dioxide just floating around) Cytoskeleton: Provides structure for the cell ER: Do protein translation (ribosomes stuck on rough ER), protein folding and transport Extracellular matrix: Provides structural support for cells. Golgi apparatus: Modifies proteins delivered from rough ER, transports lipids, creates lysosomes Lysosome: Digests foreign bodies, old organelles Mitochondria: Produce energy (ATP) in citric acid cycle. Has its own genome. Nucleus: Holds all chromosomes Peroxisome: Break down fatty acid molecules Plasma membrane: Like skin, protects inner contents of cell from outside environment © Coriell Institute for Medical Research

16 DNA: “Blueprints” for a cell
Genetic information encoded in long strings of double-stranded DNA Deoxyribo Nucleic Acid comes in only four flavors: Adenine, Cytosine, Guanine, Thymine

17 to previous nucleotide
Deoxyribose, nucleotide, base, A, C, G, T, 3’, 5’ to previous nucleotide Adenine (A) Guanine (G) O 5’ H to base O O P O C Thymine (T) Cytosine (C) H O- C C H H H H C C H 3’ Let’s write “AGACC”! to next nucleotide

18 “AGACC” (backbone)

19 “AGACC” (DNA) deoxyribonucleic acid (DNA) 3’ 5’ 5’ 3’

20 DNA is double stranded AGACC TCTGG DNA is always written 5’ to 3’
strand, reverse complement AGACC TCTGG 5’ 3’ 3’ 5’ DNA is always written 5’ to 3’ AGACC or GGTCT

21 DNA Packaging histone, nucleosome, chromatin, chromosome, centromere, telomere telomere centromere nucleosome H1 DNA H2A, H2B, H3, H4 ~146bp DNA wraps around the octamer, making 1 3/4 turns around the protein complex. The amount of DNA associated with the histone octamer is 146 bp. The octamer plus the DNA comprise what is called the nucleosome core. chromatin

22 The Genome The genome is the full set of hereditary information for an organism Humans bundle two copies of the genome into 46 chromosomes in every cell = 2 x ( X/Y)

23 Building an organism Every cell has the same sequence of DNA
Subsets of the DNA sequence determine the identity and function of different cells

24 From DNA To Organism ? Proteins do most of the work in biology, and are encoded by subsequences of DNA, known as genes.

25 RNA ribonucleotide, U to previous ribonucleotide Adenine (A)
Guanine (G) O 5’ H to base O O P O C Uracil (U) Cytosine (C) H O- C C H H H H C C T  U 3’ OH to next ribonucleotide

26 Genes & Proteins gene, transcription, translation, protein
Double-stranded DNA 5’ 3’ TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA 3’ 5’ ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT (transcription) Single-stranded RNA AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA (translation) protein

27 Gene Transcription promoter 5’ 3’ 3’ 5’
G A T T A C A . . . 5’ 3’ 3’ 5’ C T A A T G T . . . Transcription occurs at a rate of ~20-50 nt per second

28 Gene Transcription transcription factor, binding site, RNA polymerase G A T T A C A . . . 5’ 3’ 3’ 5’ C T A A T G T . . . Transcription factors: a type of protein that binds to DNA and helps initiate gene transcription. Transcription factor binding sites: short sequences of DNA (6-20 bp) recognized and bound by TFs. RNA polymerase binds a complex of TFs in the promoter.

29 Gene Transcription 5’ 3’ 3’ 5’ The two strands are separated
G A T T A C A . . . 5’ 3’ 3’ 5’ C T A A T G T . . . The two strands are separated

30 Gene Transcription G A T T A C A . . . 5’ 3’ 3’ G A U U A C A 5’ C T A A T G T . . . An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template

31 Gene Transcription 5’ 3’ 3’ 5’ pre-mRNA 5’ 3’ G A T T A C A . . .
C T A A T G T . . . G A U U A C A . . . pre-mRNA 5’ 3’

32 RNA Processing 5’ cap, polyadenylation, exon, intron, splicing, UTR, mRNA 5’ cap poly(A) tail exon intron mRNA The 5' cap is a specially altered nucleotide on the 5' end of precursor messenger RNA and some other primary RNA transcripts as found in eukaryotes. 5’ UTR 3’ UTR

33 Gene Structure introns 5’ 3’ promoter exons 3’ UTR 5’ UTR coding
non-coding

34 How many? (Human Genome)
Genes: ~ 20,000 Exons per gene: ~ 8 on average (max: 148) Nucleotides per exon: 170 on average (max: 12k) Nucleotides per intron: 5,500 on average (max: 500k) Nucleotides per gene: 45k on average (max: 2,2M)

35 From RNA to Protein Proteins are long strings of amino acids joined by peptide bonds Translation from RNA sequence to amino acid sequence performed by ribosomes 20 amino acids  3 RNA letters required to specify a single amino acid

36 Amino acid There are 20 standard amino acids H O H N C C OH H R
Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine H O H N C C OH H R There are 20 standard amino acids

37 Proteins N-terminus, C-terminus H O to previous aa N C C to next aa H
OH N-terminus (start) C-terminus (end) from 5’ ’ mRNA

38 Translation ribosome, codon mRNA P site A site
The ribosome (a complex of protein and RNA) synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid.

39 The genetic code Mapping from a codon to an amino acid

40 5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’
Translation 5’ A U U A U G G C C U G G A C U U G A ’ Translation occurs at a rate of ~15-20 AA per second UTR Met Start Codon Ala Trp Thr Stop Codon

41 5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’
Translation amino acid t-RNA Met Ala Trp 5’ A U U A U G G C C U G G A C U U G A ’ Polysomes

42 Errors? mutation What if the transcription / translation machinery makes mistakes? What is the effect of mutations in coding regions? mutations are changes in DNA/RNA (failure to repair)

43 Reading Frames 43 G C U U G U U U A C G A A U U A G

44 Synonymous Mutation 44 G G C U U G U U U A C G A A U U A G
synonymous (silent) mutation, fourfold site G G C U U G U U U A C G A A U U A G G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U G C G A A U U A G Ala Cys Leu Arg Ile 44

45 Missense Mutation 45 G G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile G C U U G G U U A C G A A U U A G Ala Trp Leu Arg Ile 45

46 Nonsense Mutation A G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile G C U U G A U U A C G A A U U A G Ala STOP

47 Frameshift G C U U G U U U A C G A A U U A G
Ala Cys Leu Arg Ile G C U U G U U A C G A A U U A G Ala Cys Tyr Glu Leu

48 Transcription and translation
Let’s see how this happens! Transcription: Translation: Illustration from Radboud University Nijmegen

49 Gene Expression Regulation
Regulation, signal transduction When should each gene be expressed? Regulate gene expression Examples: Make more of gene A when substance X is present Stop making gene B once you have enough Make genes C1, C2, C3 simultaneously Why? Every cell has same DNA but each cell expresses different proteins. Signal transduction: One signal converted to another Cascade has “master regulators” turning on many proteins, which in turn each turn on many proteins, ...

50 Gene Regulation Gene expression is controlled at many levels
DNA chromatin structure Transcription Post-transcriptional modification RNA transport Translation mRNA degradation Post-translational modification Post-transcriptional modification: Process where primary transcript RNA converted into mature RNA (e.g. splicing, 5’ capping, 3’ polyadenylation). RNA transport: Transcription occurs in the nucleus (where the chromosomes are). Translation occurs in the cytoplasm. Translation: Ribosome picks which mRNAs to make next mRNA degradation: After certain amount of time (dependent on particular mRNA, in mammals several minutes to days), mRNA degrades and can no longer be translated into protein Post-translational modifications: Attaches functional groups (lipids, carbohydrates, phosphates) to existing proteins, enzymes cleave proteins, etc.

51 Transcription regulation
Much gene regulation occurs at the level of transcription. Primary players: Binding sites (BS) in cis-regulatory modules (CRMs) Transcription factor (TF) proteins RNA polymerase II Primary mechanism: TFs link to BSs Complex of TFs forms Complex assists or inhibits formation of the RNA polymerase II machinery

52 Transcription Factor Binding Sites
Short, degenerate DNA sequences recognized by particular TFs For complex organisms, cooperative binding of multiple TFs required to initiate transcription Binding Sequence Logo

53 Summary All hereditary information encoded in double-stranded DNA
Each cell in an organism has same DNA DNA  RNA  protein Proteins have many diverse roles in cell Gene regulation diversifies protein products within different cells

54 Outline Course logistics
A zero-knowledge based introduction to biology Potential project topics

55 Which Drug Patient X Should Be Treated With?
Example project topic #1 (1/3) Which Drug Patient X Should Be Treated With? Say that a cancer patient X undergoes a chemotherapy. There are >200 drugs patient X can be treated with. How do doctors choose which drug to use in chemotherapy treatment ? Chemotherapy drugs 5-Iodotubercidin Acrichine ARQ-197 Arsenic trioxide AS101 AS AT-7519 Axitinib Azacitidine : Follicular lymphoma Since the final project is an important part of this course (50% of the grade). You might want to check out, let’s go over A few histologic features Patient X Diffuse large B cell lymphoma How can we improve this?

56 Which Drug Patient X Should Be Treated With?
Example project topic #1 (1/3) Which Drug Patient X Should Be Treated With? …ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT… A few histologic features Epigenetics (Methylation) DNA sequence RNA levels of genes Protein levels of genes Chemotherapy drugs 5-Iodotubercidin Acrichine ARQ-197 Arsenic trioxide AS101 AS AT-7519 Axitinib Azacitidine : Follicular lymphoma Since the final project is an important part of this course (50% of the grade). You might want to check out, let’s go over A few histologic features Patient X Diffuse large B cell lymphoma Doctors cannot handle millions of numbers! How about computers?

57 Let’s Build a Prediction Model
Example project topic #1 (3/3) Let’s Build a Prediction Model This is a pure machine learning problem! >3000 patients ~100 patients at UWMC Transfer learning, Feature reconstruction Patient X g1 g2 g4 g5 g6 g3 e8 g11 g14 g15 g9 g16 g g30,000 g7 g12 g13 g10 30,000 genes RNA levels of genes in cancer cells Publicly available RNA level data Goal: realizing personalized cancer treatment 30,000 features! (feature selection) 100 patients at UWMC who are fighting cancer Drug 3 Drug 2 Drug i Drug 6 Drug 4 Drug 5 Drug 160 160 drugs Drug sensitivity test Prior knowledge on drugs’ targets In collaboration with Tony Blau, Pam Becker, Ray Monnat, David Hawkins (Medicine)

58 How Well Can We Predict Disease- related Traits Based on DNA?
Example project topic #2 (1/2) How Well Can We Predict Disease- related Traits Based on DNA? DNA sequence Athin, T fat One of the most important research problems in this area is to develop new computational methods that can represent more complicated interaction between sequence variation and trait. N instances …ACTCGGTAGACCTAAATTCGGCCCGG… …ACCCGGTAGACCTTTATTCGGCCCGG… …ACCCGGTAGACCTTAATTCGGCCGGG… : …ACCCGGTAGTCCTATATTCGGCCCGG… …ACTCGGTAGTCCTATATTCGGCCGGG… …ACTCGGTAGACCTAAATTCGGCCCGG… …ACCCGGTAGACCTTTATTCGGCCCGG… …ACCCGGTAGACCTTAATTCGGCCGGG… : …ACCCGGTAGTCCTATATTCGGCCCGG… …ACTCGGTAGTCCTATATTCGGCCGGG… A T Individual1 environmental factors Individual2 Individual3 : IndividualN-1 cell, a complex system ? IndividualN p≈106 ! s1 s2 sp Obesity too weak to be detected ? Before I start talking about the computational biology course, … a simple quiz problem Causality? Standard approach Find a simple rule! Failed to detect the DNA affecting many important traits. obesity

59 How Well Can We Predict Disease-related Traits Based on DNA?
Example project topic #2 (2/2) How Well Can We Predict Disease-related Traits Based on DNA? ~2000 subjects Longitudinal study Environmental factors Age, sex, smoking status …ACTCGGACCTAAATCCCG… …ACCCGGACCTTAATGCGG… …ACCCGGACCTATATGCCG… …ACCCGGACCTTTATGCCG… …ACCCGGTCCTATATGCCG… …ACTCGGTCCTTAATGCGG… …ACTCGGTCCTATATGCGG… : Sequence Information s1 s2 s3 s4 sP p≈106 ! (feature selection) Phenotype Data Cholesterol Age-specific genetic influence Before I start talking about the computational biology course, … a simple quiz problem Year 0 Fatty acid Structural learning Glucose Insulin : : Phenotype Data Cholesterol Year 25 Fatty acid Glucose Insulin In collaboration with Alex Reiner (Epidemiology)

60 More project topics at the course website!
Questions?


Download ppt "Welcome to CSE 527: Computational Biology"

Similar presentations


Ads by Google