Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

Similar presentations


Presentation on theme: "Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding."— Presentation transcript:

1 Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding the meaning & effect of search (e.g. BLAST) parameters 2. functional analysis of single sequences - i.e. how to work out what your unknown protein might be doing - complex searches for (e.g.) patterns of motifs & secondary structure elements

2 Workshop 1. overall survey of data Mutation between species -> orthologs Mutation between duplications -> domains Search methods – 2D vs. 3D Search methods – similarity vs. models vs. comparative Main data axes Main Portals Database searches vs. genome browsers Finding similar sequences BLAST, et al E-values! Biological origin of sequences Genes vs.loci Random sequences

3 Using Public Data Resources There is (are!) data out there There are methods out there Quite often they are combined –BLAST searches of sequence databases

4 Notes… Sequence databases –Entrez queries… Genome browsers/databases Regulatory Elements SNPs Functional Sequence Models (PFam domains, etc.) Expression Data –Array data –in situ data

5 Notes II Blast parameters –Low complexity: frameshifted cDNA –miRNAs vs genome –morpholinos for other genes –-q-2 for EST vs EST alignments –Entrez queries

6 What have we got… gene model locus ~ gene mRNA protein genome primary transcript

7 Derivative Sequences mRNA clone into cDNA library 3’ EST 5’ EST cDNA sequence Single pass sequence from each end of the clone Multiple pass sequencing over whole length of the clone

8 Initial Growth of Databases Lots of ESTs were generated Some clones were selected for full-insert sequencing -> cDNAs cDNAs were translated to yield presumed protein sequences

9 Then Came Genomes With increasing larger fragments of genomic sequence came the ability to align cDNAs to create gene models And then to apply our understanding of exon/intron structure to predict theoretical genes…

10 Introns and Exons gene model genome CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA GTAAG. donor.TTTCAG acceptor mRNA exon intron exon intron exon splice sites

11 Gene Predictions Given: - coding sequence must run from ATG – STOP codon in-frame - introns GT...... AG can be spliced out Also take a statistical approach: - coding and non-coding sequence are slightly different in composition - some ‘possible’ splice sites are more likely than others...CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.....CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.....CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA... scan genomic sequence …...CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.. most likely gene model

12 Supporting Evidence! EST evidence genome gene model We note that even though there is good evidence for the existence of all four exons, there is no evidence that all the exons would appear on a real transcript. An alternative transcript, skipping exon 3, would be plausible, if a little unlikely. This gets less ambiguous as more ESTs are available, and clones are sequenced at both ends (which helps put distant exons into the same transcripts), and eventually full-length transcript sequences are available. exons: 1 2 3 4

13 So What’s in the Databases Now? At NCBI –15,000,000 EST sequences – 3,329,110 non-redundant DNA sequences (excluding ESTs, etc.) –2,693,904 non-redundant translated coding sequences –954,378 Protein Reference Sequences sequences (RefSeq) But the majority of RefSeq may be translations of theoretical transcripts…

14 Main Data Axes Europe: EBI/EMBL –Swiss-Prot/Trembl/Ensembl/UniProt US: NIH/NCBI –GenBank/UniGene/RefSeq/Entrez Japan: DNA Data Bank of Japan –National Institute of Genetics

15 Synchronisation… GenBank DDBJ EMBL ATCGATCGATCATAGTATGCTAGCTGCTA BC009638.1 ATCGATCGATCATAGTATGCTAGCTGCTA BC009638.1 ATCGATCGATCATAGTATGCTAGCTGCTA You submit a sequence BC009638.1 ATCGATCGATCATAGTATGCTAGCTGCTA

16 Sequences, Accession Numbers and Genes NM_001015922.1 gi=62860271 GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA BC009638.1 gi=16307106 GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA NM_001015922.2 gi=62860589 GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

17 Main Data Portals NCBI Entrez DatabasesEntrez Databases ExPASy Proteomics ServerProteomics Server DNA Data Bank of Japan DDBJDDBJ EBI Ensembl Genome BrowserEnsembl Genome Browser Santa Cruz Genome BrowserGenome Browser


Download ppt "Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding."

Similar presentations


Ads by Google