Presentation is loading. Please wait.

Presentation is loading. Please wait.

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium.

Similar presentations


Presentation on theme: "Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium."— Presentation transcript:

1

2 Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium

3 Goals of this course Introduction to Computational Biology & Genomics  Basic concepts and scientific questions  Why does it matter?  Basic biology for computer scientists  In-depth coverage of algorithmic techniques  Current active areas of research Useful algorithms  Dynamic programming  String algorithms  HMMs and other graphical models for sequence analysis

4 Topics in CS262 Part 1: Basic Algorithms  Sequence Alignment & Dynamic Programming  Hidden Markov models, Context Free Grammars, Conditional Random Fields Part 2: Topics in computational genomics and areas of active research  DNA sequencing  Comparative genomics  Genes: finding genes, gene regulation  Proteins, families, and evolution  Networks of protein interactions

5 Course responsibilities Homeworks  4 challenging problem sets, 4-5 problems/pset Due at beginning of class Up to 3 late days (24-hr periods) for the quarter  Collaboration allowed – please give credit Teams of 2 or 3 students Individual writeups If individual (no team) then drop score of worst problem per problem set (Optional) Scribing  Due one week after the lecture, except special permission  Scribing grade replaces 2 lowest problems from all problem sets First-come first-serve, email staff list to sign up

6 Reading material Books  “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison Chapters 1-4, 6, 7-8, 9-10  “Algorithms on strings, trees, and sequences” by Gusfield Chapters 5-7, 11-12, 13, 14, 17 Papers Lecture notes

7 Birth of Molecular Biology DNA Phosphate Group Sugar Nitrogenous Base A, C, G, T PhysicistOrnithologist

8 T C A C T G G C G A G T C A G C G A G U C A G C DNARNA A - T G - C T  U

9 DNA DNA is written 5’ to 3’ by convention AGACC = GGTCT 3’ 5’ 3’

10 Chromosomes H1DNA H2A, H2B, H3, H4 ~146bp telomere centromere nucleosome chromatin In humans: 2x22 autosomes X, Y sex chromosomes

11 The Genetic Dogma 3’ 5’ 3’ TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA (transcription) (translation) Single-stranded RNA protein Double-stranded DNA

12 DNA to RNA to Protein to Cell DNA, ~3x10 9 long in humans Contains ~ 22,000 genes G A G U C A G C messenger-RNA transcriptiontranslationfolding

13 Gene Transcription 3’ 5’ 3’ G A T T A C A... C T A A T G T...

14 Gene Transcription 3’ 5’ 3’ The promoter lies upstream of a gene Transcription factors recognize transcription factor binding sites and bind to them, forming a complex RNA polymerase binds the complex G A T T A C A... C T A A T G T...

15 Gene Transcription 3’ 5’ 3’ The two strands are separated G A T T A C A... C T A A T G T...

16 Gene Transcription 3’ 5’ 3’ An RNA copy of the 5’ → 3’ sequence is created from the 3’ → 5’ template G A T T A C A... C T A A T G T... G A U U A C A

17 Gene Transcription 3’ 5’ 3’ G A U U A C A... G A T T A C A... C T A A T G T... pre-mRNA5’3’

18 RNA Processing 5’ cap poly(A) tail intron exon mRNA 5’ UTR3’ UTR pre-mRNA

19 Gene Structure 5’3’ promoter 5’ UTR exons3’ UTR introns coding non-coding

20 How many? Genes:  ~22,000 in the human genome Exons per gene: ~ 8 on average (max: 148) Nucleotides per exon: 170 on average (max: 12k) Nucleotides per intron: 5,500 on average (max: 500k) Nucleotides per gene: 45k on average (max: 2,2M)

21 Composed of a chain of amino acids. R | H 2 N--C--COOH | H Proteins 20 possible groups Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine

22 R R | | H 2 N--C--COOH H 2 N--C--COOH | | H H Proteins Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine

23 Dipeptide R O R | II | H 2 N--C--C--NH--C--COOH | | H H This is a peptide bond Alanine Arginine Asparagine Aspartate Cysteine Glutamate Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine

24 Protein structure Linear sequence of amino acids folds to form a complex 3-D structure The structure of a protein is intimately connected to its function

25 Translation The ribosome synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid. mRNA P siteA site

26 The Genetic Code UCAG U UUU Phenylalanine (Phe)UCU Serine (Ser)UAU Tyrosine (Tyr)UGU Cysteine (Cys)U UUC PheUCC SerUAC TyrUGC CysC UUA Leucine (Leu)UCA SerUAA STOPUGA STOPA UUG LeuUCG SerUAG STOPUGG Tryptophan (Trp)G C CUU Leucine (Leu)CCU Proline (Pro)CAU Histidine (His)CGU Arginine (Arg)U CUC LeuCCC ProCAC HisCGC ArgC CUA LeuCCA ProCAA Glutamine (Gln)CGA ArgA CUG LeuCCG ProCAG GlnCGG ArgG A AUU Isoleucine (Ile)ACU Threonine (Thr)AAU Asparagine (Asn)AGU Serine (Ser)U AUC IleACC ThrAAC AsnAGC SerC AUA IleACA ThrAAA Lysine (Lys)AGA Arginine (Arg)A AUG Methionine (Met) or STARTACG ThrAAG LysAGG ArgG G GUU Valine (Val)GCU Alanine (Ala)GAU Aspartic acid (Asp)GGU Glycine (Gly)U GUC ValGCC AlaGAC AspGGC GlyC GUA ValGCA AlaGAA Glutamic acid (Glu)GGA GlyA GUG ValGCG AlaGAG GluGGG GlyG

27 Translation (tRNA) C C A Tryptophan anticodon

28 Translation 5’... A U U A U G G C C U G G A C U U G A... 3’ UTR Met Start Codon AlaTrpThr

29 Translation 5’... A U U A U G G C C U G G A C U U G A... 3’

30 Translation MetAla 5’... A U U A U G G C C U G G A C U U G A... 3’ Trp

31 Errors? What if the transcription / translation machinery makes mistakes? What is the effect of mutations in coding regions?

32 Reading Frames G C U U G U U U A C G A A U U A G

33 Synonymous Mutation G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G G G C U U G U U U G C G A A U U A G Ala Cys Leu Arg Ile

34 Missense Mutation G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G G G C U U G G U U A C G A A U U A G Ala Trp Leu Arg Ile

35 Nonsense Mutation G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G A G C U U G A U U A C G A A U U A G Ala STOP

36 Frameshift G C U U G U U U A C G A A U U A G Ala Cys Leu Arg Ile G C U U G U U U A C G A A U U A G G C U U G U U A C G A A U U A G Ala Cys Tyr Glu Leu

37 Noncoding RNA 3’ 5’ 3’ G A U U A C A... G A T T A C A... C T A A T G T... 5’3’

38 Genetics in the 20 th Century

39 21 st Century AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT AGTAGGACAGACTACGACGAGACGAT CGTGCGAGCGACGGCGTAGTGTGCTG TACTGTCGTGTGTGTGTACTCTCCTC TCTCTAGTCTACGTGCTGTATGCGTT AGTGTCGTCGTCTAGTAGTCGCGATG CTCTGATGTTAGAGGATGCACGATGC TGCTGCTACTAGCGTGCTGCTGCGAT GTAGCTGTCGTACGTGTAGTGTGCTG TAAGTCGAGTGTAGCTGGCGATGTAT CGTGGT

40 Computational Biology Organize & analyze massive amounts of biological data  Enable biologists to use data  Form testable hypotheses  Discover new biology AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT

41 DNA to RNA to Protein to Cell DNA, ~3x10 9 long in humans Contains ~ 22,000 genes G A G U C A G C messenger-RNA transcriptiontranslationfolding

42 Some Topics in CS262 1. Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides ~500 nucleotides

43 Some Topics in CS262 1. Sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces

44 Complete genomes today More than 300 complete genomes have been sequenced

45 Where are the genes? 2. Gene Finding In humans: ~22,000 genes ~1.5% of human DNA

46 atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

47 3. Molecular Evolution

48 Evolution at the DNA level OK X X Still OK? next generation

49 4. Sequence Comparison Sequence conservation implies function Sequence comparison is key to Finding genes Determining function Uncovering the evolutionary processes

50 Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Sequence Alignment Introduced ~1970 BLAST: 1990, most cited paper in history Still very active area of research query DB BLAST

51 5. RNA Structure Predict: Given: AGCAGAGUGG … an unfolded RNA sequence AGCACAGUGA … + aligned homologs ACUAGACAGG … CGCCGAGUCG … AGCAGUGUGG … bulge loop helix (stem) hairpin loop internal loop multi- branch loop which nucleotides base pair?

52 6. Protein networks Fresh research area Construct networks from multiple data sources Navigate networks Compare networks across organisms

53 Computer Scientists vs Biologists

54 Computer scientists vs Biologists Nothing is ever true or false in Biology Everything is true or false in computer science

55 Computer scientists vs Biologists Biologists strive to understand the complicated, messy natural world Computer scientists seek to build their own clean and organized virtual worlds

56 Biologists are obsessed with being the first to discover something Computer scientists are obsessed with being the first to invent or prove something Computer scientists vs Biologists

57 Biologists are comfortable with the idea that all data have errors Computer scientists are not Computer scientists vs Biologists

58 Computer scientists get high-paid jobs after graduation Biologists typically have to complete one or more 5-year post-docs... Computer scientists vs Biologists

59 Computer Science is to Biology what Mathematics is to Physics


Download ppt "Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium."

Similar presentations


Ads by Google