Download presentation
Presentation is loading. Please wait.
Published byNathan Ryan Modified over 8 years ago
1
1 Mona Singh What is computational biology?
2
2 Mona Singh Genome The entire hereditary information content of an organism
3
3 Mona Singh DNA String over 4 letter alphabet A, T, G, C Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) Genome size: number of base pairs in an organism
4
4 Mona Singh Genome Sizes Human3 billion bps Mouse3 billion bps Fruit fly165 million bps Nematode worm97 million bps Yeast15 million bps E coli5 million bps ~ 400 genomes sequenced
5
5 Mona Singh How are genomes sequenced? Can only sequence a few hundred base pairs at a time Make many copies of the DNA and cut into smaller (overlapping) pieces Assemble pieces: certain substrings occur in multiple fragments
6
6 Mona Singh Genomes to Life ATGCCTTAC GTACCCTGC GGCAGCACT ? Genome
7
7 Mona Singh Portions of DNA code for genes, which carry the information for making proteins Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)
8
8 Mona Singh gucgcuaccauuaccaguuggucuggugucaaaaauaauaau aaccgggcaggccaugucugcccguauuucgcguaaggaaau ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu ucccguuuuucccgauuuggcuacaugacaucaaccauauca gcaaaagugauacggguauuauuuuugccgcuauuucucugu ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca aacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggc cacuguuacaauacaacauuuuaguaggaucgauuguuggug guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag uagaggcauuuauugagaaagucagccgucgcaguaauuucg aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc ugugugccucgauugucggcaucauguucaccaucaauaauc aguuuguuuucuggcugggcucuggcugugcacucauccucg ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuua gccuuaagcuggcacuggaacuguucagacagccaaaacugu gguuuuugucacuguauguuauuggcguuuccugcaccuacg auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu Gene Finding
9
9 Mona Singh gucgcuaccauuaccaguuggucuggugucaaaaauaauaau aaccgggcaggccaugucugcccguauuucgcguaaggaaau ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu ucccguuuuucccgauuuggcuacaugacaucaaccauauca gcaaaagugauacggguauuauuuuugccgcuauuucucugu ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca aacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggc cacuguuacaauacaacauuuuaguaggaucgauuguuggug guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag uagaggcauuuauugagaaagucagccgucgcaguaauuucg aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc ugugugccucgauugucggcaucauguucaccaucaauaauc aguuuguuuucuggcugggcucuggcugugcacucauccucg ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuua gccuuaagcuggcacuggaacuguucagacagccaaaacugu gguuuuugucacuguauguuauuggcguuuccugcaccuacg auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu MYYLKNTNFWMFGLFFFFYFFIMGAY FPFFPIWLHDINHISKSDTGIIFAAI SLFSLLFQPLFGLLSDKLGLRKYLLW IITGMLVMFAPFFIFIFGPLLQYNIL VGSIVGGIYLGFCFNAGAPAVEAFIE KVSRRSNFEFGRARMFGCVGWALCAS IVGIMFTINNQFVFWLGSGCALILAV LLFFAKTDAPSSATVANAVGANHSAF SLKLALELFRQPKLWFLSLYVIGVSC TYDVFDQQFANFFTSFFATGEQGTRV FGYVTTMGELLNASIMFFAPLIINRI GGKNALLLAGTIMSVRIIGSSFATSA LEVVILKTLHMFEVPFLLVGCFKYIT Gene Finding
10
10 Mona Singh AUG = methionine/start UUA = Leucine UUG = Leucine UAA = Stop UAG = Stop UGA = Stop. The Genetic Code Stryer, Biochemistry
11
11 Mona Singh Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgg gcaggccaugucugcccguauuucgcguaaggaaauccauuauguacu auuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuu acuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuac augacaucaaccauaucagcaaaagugauacggguauuauuuuugccg cuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuu cugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggccacugu uacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuag gcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugaga aagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuug gcuguguuggcugggcgcugugugccucgauugucggcaucauguuca ccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacuca uccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuuagccuua agcuggcacuggaacuguucagacagccaaaacugugguuuuugucac uguauguuauuggcguuuccugcaccuacgauguuuuugaccaacagu uugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugu cggaugcggcgcgacgcu
12
12 Mona Singh Gene Finding aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa... Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys... M S A R I S R K E I H Y V L F K... Reading off from 1 st start triplet Translating (3 letter amino acid code) (1 letter code)
13
13 Mona Singh Gene Finding aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa... Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys... M S A R I S R K E I H Y V L F K... Reading off from 1 st start triplet Translating (3 letter amino acid code) (1 letter code) M Y Y L K N T N F W M F G L F F... Actual protein sequence
14
14 Mona Singh Computational Gene Finding Methods Statistical bias: protein coding regions “look different” - compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets) Sequence similarity - similar to known protein?
15
15 Mona Singh Gene finding is hard In some genomes, only a small portion of genome codes for protein (needle in haystack) Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short Have to get the precise boundaries to get correct protein
16
16 Mona Singh Number of genes Human~30,000 Mouse~30,000 Fruit fly~13,500 Nematode worm~19,000 Yeast~6,000 E coli~4,000
17
17 Mona Singh MYYLKNTNFWMFGLFFFFYFFIMGAY FPFFPIWLHDINHISKSDTGIIFAAI SLFSLLFQPLFGLLSDKLGLRKYLLW IITGMLVMFAPFFIFIFGPLLQYNIL VGSIVGGIYLGFCFNAGAPAVEAFIE KVSRRSNFEFGRARMFGCVGWALCAS IVGIMFTINNQFVFWLGSGCALILAV LLFFAKTDAPSSATVANAVGANHSAF SLKLALELFRQPKLWFLSLYVIGVSC TYDVFDQQFANFFTSFFATGEQGTRV FGYVTTMGELLNASIMFFAPLIINRI GGKNALLLAGTIMSVRIIGSSFATSA LEVVILKTLHMFEVPFLLVGCFKYIT Predicting Protein Function DNA binding protein
18
18 Mona Singh Functions of Human Proteins Science, 2001
19
19 Mona Singh Sequence similarity CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL----- NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSG NT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL--------- CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGG NT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV Ex: cystic fibrosis gene and bacterial nickel transport gene
20
20 Mona Singh Database Searches http://www.ncbi.nlm.nih.gov
21
21 Mona Singh Database Searches Sequences producing significant alignments: E-Value gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84 gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77 gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59 gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59 gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43 gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41 gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41 gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41
22
22 Mona Singh Protein Structure Sequence: KETAAAKFERQHMDSSTSAASSSN… Structure:
23
23 Mona Singh Primary TertiarySecondaryQuaternary Amino acids helix Polypeptide chain Assembled subunits Proteins Lehninger, Principles of Biochemistry
24
24 Mona Singh Protein Structure Prediction Physics-based methods Statistics-based method
25
25 Mona Singh Statistics & Protein Structure Prediction Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.
26
26 Mona Singh Secondary structure prediction Given a protein sequence, can you tell its secondary structure –E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa a=alpha, b=beta : ~70% accuracy (neural nets or other learning techniques)
27
27 Mona Singh Genome annotation Many other important features of DNA –E.g., proteins bind DNA regulatory elements: determines which genes are “on” when Statistical & comparative approaches for finding them –Motif finding
28
28 Mona Singh ProkaryotesEukaryotes Universal phylogenetic tree Woese et al.
29
29 Mona Singh Building phylogenetic trees Use DNA (or protein) sequences from various organisms e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA
30
30 Mona Singh Building phylogenetic trees HumanMouseYeast Human024 Mouse204 Yeast440 E.g., Distance Matrix: Tree: 1 1 1 2 Human Mouse Yeast
31
31 Mona Singh Intracellular networks
32
32 Mona Singh Network of cells
33
33 Mona Singh fn
34
34 Mona Singh Lecture Notes www.cs.princeton.edu/~mona/computational_biology_ notes.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.