Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.

Similar presentations


Presentation on theme: "1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism."— Presentation transcript:

1 1 Mona Singh What is computational biology?

2 2 Mona Singh Genome The entire hereditary information content of an organism

3 3 Mona Singh DNA String over 4 letter alphabet A, T, G, C Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) Genome size: number of base pairs in an organism

4 4 Mona Singh Genome Sizes Human3 billion bps Mouse3 billion bps Fruit fly165 million bps Nematode worm97 million bps Yeast15 million bps E coli5 million bps ~ 400 genomes sequenced

5 5 Mona Singh How are genomes sequenced? Can only sequence a few hundred base pairs at a time Make many copies of the DNA and cut into smaller (overlapping) pieces Assemble pieces: certain substrings occur in multiple fragments

6 6 Mona Singh Genomes to Life ATGCCTTAC GTACCCTGC GGCAGCACT ? Genome

7 7 Mona Singh Portions of DNA code for genes, which carry the information for making proteins Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)

8 8 Mona Singh gucgcuaccauuaccaguuggucuggugucaaaaauaauaau aaccgggcaggccaugucugcccguauuucgcguaaggaaau ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu ucccguuuuucccgauuuggcuacaugacaucaaccauauca gcaaaagugauacggguauuauuuuugccgcuauuucucugu ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca aacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggc cacuguuacaauacaacauuuuaguaggaucgauuguuggug guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag uagaggcauuuauugagaaagucagccgucgcaguaauuucg aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc ugugugccucgauugucggcaucauguucaccaucaauaauc aguuuguuuucuggcugggcucuggcugugcacucauccucg ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuua gccuuaagcuggcacuggaacuguucagacagccaaaacugu gguuuuugucacuguauguuauuggcguuuccugcaccuacg auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu Gene Finding

9 9 Mona Singh gucgcuaccauuaccaguuggucuggugucaaaaauaauaau aaccgggcaggccaugucugcccguauuucgcguaaggaaau ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu ucccguuuuucccgauuuggcuacaugacaucaaccauauca gcaaaagugauacggguauuauuuuugccgcuauuucucugu ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca aacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggc cacuguuacaauacaacauuuuaguaggaucgauuguuggug guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag uagaggcauuuauugagaaagucagccgucgcaguaauuucg aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc ugugugccucgauugucggcaucauguucaccaucaauaauc aguuuguuuucuggcugggcucuggcugugcacucauccucg ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuua gccuuaagcuggcacuggaacuguucagacagccaaaacugu gguuuuugucacuguauguuauuggcguuuccugcaccuacg auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu MYYLKNTNFWMFGLFFFFYFFIMGAY FPFFPIWLHDINHISKSDTGIIFAAI SLFSLLFQPLFGLLSDKLGLRKYLLW IITGMLVMFAPFFIFIFGPLLQYNIL VGSIVGGIYLGFCFNAGAPAVEAFIE KVSRRSNFEFGRARMFGCVGWALCAS IVGIMFTINNQFVFWLGSGCALILAV LLFFAKTDAPSSATVANAVGANHSAF SLKLALELFRQPKLWFLSLYVIGVSC TYDVFDQQFANFFTSFFATGEQGTRV FGYVTTMGELLNASIMFFAPLIINRI GGKNALLLAGTIMSVRIIGSSFATSA LEVVILKTLHMFEVPFLLVGCFKYIT Gene Finding

10 10 Mona Singh AUG = methionine/start UUA = Leucine UUG = Leucine UAA = Stop UAG = Stop UGA = Stop. The Genetic Code Stryer, Biochemistry

11 11 Mona Singh Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgg gcaggccaugucugcccguauuucgcguaaggaaauccauuauguacu auuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuu acuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuac augacaucaaccauaucagcaaaagugauacggguauuauuuuugccg cuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuu cugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggccacugu uacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuag gcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugaga aagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuug gcuguguuggcugggcgcugugugccucgauugucggcaucauguuca ccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacuca uccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuuagccuua agcuggcacuggaacuguucagacagccaaaacugugguuuuugucac uguauguuauuggcguuuccugcaccuacgauguuuuugaccaacagu uugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugu cggaugcggcgcgacgcu

12 12 Mona Singh Gene Finding aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa... Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys... M S A R I S R K E I H Y V L F K... Reading off from 1 st start triplet Translating (3 letter amino acid code) (1 letter code)

13 13 Mona Singh Gene Finding aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa... Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys... M S A R I S R K E I H Y V L F K... Reading off from 1 st start triplet Translating (3 letter amino acid code) (1 letter code) M Y Y L K N T N F W M F G L F F... Actual protein sequence

14 14 Mona Singh Computational Gene Finding Methods Statistical bias: protein coding regions “look different” - compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets) Sequence similarity - similar to known protein?

15 15 Mona Singh Gene finding is hard In some genomes, only a small portion of genome codes for protein (needle in haystack) Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short Have to get the precise boundaries to get correct protein

16 16 Mona Singh Number of genes Human~30,000 Mouse~30,000 Fruit fly~13,500 Nematode worm~19,000 Yeast~6,000 E coli~4,000

17 17 Mona Singh MYYLKNTNFWMFGLFFFFYFFIMGAY FPFFPIWLHDINHISKSDTGIIFAAI SLFSLLFQPLFGLLSDKLGLRKYLLW IITGMLVMFAPFFIFIFGPLLQYNIL VGSIVGGIYLGFCFNAGAPAVEAFIE KVSRRSNFEFGRARMFGCVGWALCAS IVGIMFTINNQFVFWLGSGCALILAV LLFFAKTDAPSSATVANAVGANHSAF SLKLALELFRQPKLWFLSLYVIGVSC TYDVFDQQFANFFTSFFATGEQGTRV FGYVTTMGELLNASIMFFAPLIINRI GGKNALLLAGTIMSVRIIGSSFATSA LEVVILKTLHMFEVPFLLVGCFKYIT Predicting Protein Function DNA binding protein

18 18 Mona Singh Functions of Human Proteins Science, 2001

19 19 Mona Singh Sequence similarity CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL----- NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSG NT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL--------- CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGG NT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV Ex: cystic fibrosis gene and bacterial nickel transport gene

20 20 Mona Singh Database Searches http://www.ncbi.nlm.nih.gov

21 21 Mona Singh Database Searches Sequences producing significant alignments: E-Value gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84 gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77 gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59 gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59 gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43 gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41 gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41 gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41

22 22 Mona Singh Protein Structure Sequence: KETAAAKFERQHMDSSTSAASSSN… Structure:

23 23 Mona Singh Primary TertiarySecondaryQuaternary Amino acids  helix Polypeptide chain Assembled subunits Proteins Lehninger, Principles of Biochemistry

24 24 Mona Singh Protein Structure Prediction Physics-based methods Statistics-based method

25 25 Mona Singh Statistics & Protein Structure Prediction Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.

26 26 Mona Singh Secondary structure prediction Given a protein sequence, can you tell its secondary structure –E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa a=alpha, b=beta : ~70% accuracy (neural nets or other learning techniques)

27 27 Mona Singh Genome annotation Many other important features of DNA –E.g., proteins bind DNA regulatory elements: determines which genes are “on” when Statistical & comparative approaches for finding them –Motif finding

28 28 Mona Singh ProkaryotesEukaryotes Universal phylogenetic tree Woese et al.

29 29 Mona Singh Building phylogenetic trees Use DNA (or protein) sequences from various organisms e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA

30 30 Mona Singh Building phylogenetic trees HumanMouseYeast Human024 Mouse204 Yeast440 E.g., Distance Matrix: Tree: 1 1 1 2 Human Mouse Yeast

31 31 Mona Singh Intracellular networks

32 32 Mona Singh Network of cells

33 33 Mona Singh fn

34 34 Mona Singh Lecture Notes www.cs.princeton.edu/~mona/computational_biology_ notes.html


Download ppt "1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism."

Similar presentations


Ads by Google