Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010.

Similar presentations


Presentation on theme: "Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010."— Presentation transcript:

1 Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010

2 Genome Informatics Workshop Gene Finding

3 atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa

4 atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacagatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaagtttttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaggtgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa Sequencing is just the beginning of the process Extracting information & interpreting What´s there where are the genes which genes how to find them? SEQUENCE ANNOTATION Sequencing is just the beginning of the process Extracting information & interpreting What´s there where are the genes which genes how to find them? SEQUENCE ANNOTATION

5 Strategies for sequence annotation  Predictive methods  Comparative methods  Experimental methods Interpretation of the DNA sequence into genes according to rules

6

7

8 Strategies for sequence annotation  Predictive methods  Comparative methods  Experimental methods Interpretation of the DNA sequence into genes according to rules Interpretation of the DNA sequence into genes according to similarities with other sequences

9

10 Strategies for sequence annotation  Predictive methods  Comparative methods  Experimental methods Interpretation of the DNA sequence into genes according to rules Interpretation of the DNA sequence into genes according to similarities with other sequences Interpretation of the DNA sequence into genes according to experimental results (e.g. cDNA)

11 EST Blast Hit

12 Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames are coding sequences

13 Gene prediction Gene finderGlimmer Orpheus PHAT GeneMark

14 Gene finding Accurately predict sample set of genes Sequence base composition sequence alignment to related gene (e.g. orthologue) sequence alignment transcript data (e.g. EST) training set Gene finding software Full gene set

15 Gene finding programs Genefinding software packages use Hidden Markov Models. Predict coding, intergenic and intron sequences Need to be trained on a specific organism. Never perfect!

16 Gene prediction programs: Problems ORFs are not equivalent to CDSs Gene prediction programs find new genes that share properties with a given set of genes. They can be confounded by: –Sequence constraints (ribosomal proteins etc.) –Sequence biases –Different sets of genes –Horizontal gene transfer –Non-coding DNA

17 Gene prediction programs: Problems Different gene training sets: Plasmodium falciparum Original annotation Updated annotation

18 Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes glimmer genefinder final orpheus glimmer genefinder final orpheus

19 Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats glimmer orpheus final glimmer orpheus final

20 Gene prediction programs: Problems Pseudogenes M. leprae

21 Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer

22 Gene prediction programs: Problems Pseudogenes: M. leprae ORPHEUS

23 Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis

24 Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation

25 Campylobacter jejuni Neisseria meningitidis A Salmonella typhi Yersinia pestis Organism Size (Mb)G+C CDS prediction GlimmerORPHEUSotherFinal Mycobacterium leprae 1654176115181.64130.55 2121313420242.18451.81 1783 3 Start-to-stop >100 aa Gene prediction programs: Statistics 1605 intact 1115 pseudo 94944273.26857.80 G2 5679 3 TIGR CMR (http://www.tigr.org/) 4 4 4600519446664.80952.094973 5 GeneFinder (Krogh+Larson pers comm) 5 4011265443124.65447.64 http://pedant.mips.biochem.mpg.de/orpheus/index.html 2 http://www.tigr.org/softlab/glimmer/glimmer.html 1 112

26 The Gene Prediction Process DNA SEQUENCE ANNALYSIS SOFTWARE Usefull CDS Prediction Annotator AT content Gene finders Codon Usage BlastX FASTA ESTs

27 Eukaryotic gene AAAAAAAAAA CAP AAAAAAAAAA CAP TTTTTTTTT intron Exon II 5’UTR Exon I stop 3’UTR EST cDNA mRNA EST Exon III ATG GT AG GT AG

28 AT content Coding regions have higher GC content in AT rich genomes

29 AT content

30 CODON USAGE Codon bias is different for each organism. DNA content in coding regions is restricted – but it is not restricted in non coding regions. The codon usage for any particular gene can influence expression.

31 Codon usage All organisms have a preferred set of codons. Malaria Trypanosoma GUU 0.41 GUU 0.28 GUC 0.06GUC 0.19 GUA 0.42 GUA 0.14 GUG 0.11 GUG 0.39

32 Codon Usage http://www.kazusa.or.jp/codon/

33 Codon Usage Table UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6( 35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC 5.3( 4141) UAC 5.5( 4340) UGC 2.4( 1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA 1.0( 813) UGA 0.2( 188) UUG 10.1( 7911) UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2( 4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU 19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354) CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5( 354) CUA 5.4( 4217) CCA 13.1( 10221) CAA 25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044) CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2( 184) AUU 34.0( 26611) ACU 12.8( 10050) AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9( 4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC 3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822) AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9( 16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG 3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811) GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6( 2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC 1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871) GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9( 3806) GCG 1.1( 890) GAG 10.1( 7878) GGG 2.9( 2243)

34 Codon Usage in Artemis Forward frames Reverse frames

35 Codon usage & gene finding in : Leishmania

36 Transcriptional units in Leishmania: DNA strand-switches

37 GC frame plot Plots the third position GC content of each frame of a DNA sequence. In coding DNA the GC content of the 3 rd base is often higher. Good prediction of coding in malaria and trypanosomes.

38 GC frame plot of tubulin gene cluster on T. brucei Chr 1

39 Large-scale nucleotide plots in Artemis: S. typhi genome GC content, GC deviation, Karlin signature

40 Homology Data Coding regions are more conserved than non coding regions due to selective pressure. Comparing all possible translations against all known proteins will give clues to known genes. Blastx

41 Gene finding: using ACT TBLASTX comparisons P. knowlesi P. falciparum P. yoelii

42 Using FASTA / BLAST Results FASTA is a global alignment tool BLAST is a local alignment tool BLAST FASTA

43 Functional assignment: alignments of modular proteins A B A B C A B C

44 Gene finding by RNA-Seq (Transcriptional landscape of Neospora caninum Tachyzoites Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq)

45 Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq) N. caninum Chr08 T. gondii Chr08 5’ UTR 3’ UTR TBLASTX matches visualised in ACT Transcriptome sequencing in Neospora (RNAseq is useful for predicting/confirming UTR boundaries)

46 RNA-Seq: correcting gene models Before %GC After %GC __16hr, __32hr, __48hr


Download ppt "Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010."

Similar presentations


Ads by Google