Annotating The data.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Eukaryotic Gene Finding
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Why Manual Genome Annotation? Even the best gene predictors and genome annotation pipelines rarely exceed accuracies of 80% at the exon level, meaning.
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Research about Alternative Splicing recently 楊佳熒.
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Bacterial infection by lytic virus
Annotation for D. virilis
ORF Calling.
bacteria and eukaryotes
Bacterial infection by lytic virus
”Gene Finding in Eukaryotic Genomes”
EGASP 2005 Evaluation Protocol
Eukaryotic Gene Structure
VectorBase genome annotation
EGASP 2005 Evaluation Protocol
Genome Sequence Annotation Server
Genome Sequence Annotation Server
Gene architecture and sequence annotation
Genome Editing with Apollo
Gene Annotation with DNA Subway
Genome Annotation w/ MAKER
Introduction to Bioinformatics II
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Ensembl Genome Repository.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Reading Frames and ORF’s
Follow-up from last night: XSEDE credits
4. HMMs for gene finding HMM Ability to model grammar
Introduction to Alternative Splicing and my research report
Gene Structure.
DNA Crash course…..
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Gene Structure.
Presentation transcript:

Annotating The data

• 2) Functional: Find out what the regions do. What do they code for? There are two major parts of annotation •  1) Structural: Find out where the regions of interest (usually genes) are in the genome and what they look like. How many exons/introns? UTRs? Isoforms? •  2) Functional: Find out what the regions do. What do they code for?

Open reading frames Just look at ORFs?

Difficult in practice Example from apollo

No introns, but where are the start and stop codons? Transcriptomes are different but have their own challenges No introns, but where are the start and stop codons? Still needs functional annotation

Data used - Proteins

Data used - Proteins Conserved in sequence => conserved annotation with little noise Proteins from model organisms onen used => bias? Proteins can be incomplete => problems as many annotation procedures are heavily dependent on protein alignments >ENSTGUP00000017616 pep:novel chromosome:taeGut3ti2ti4:8_random:2849599:2959678:-­‐1 gene:ENSTGUG00000017338 transcript:ENSTGUT000000180 RSPNATEYNWHHLRYPKIPERLNPPAAAGPALSTAEGWMLPWGNGQHPLLARAPGKGRER DGKELIKKPKTFKFTFLKKKKKKKKKTFK >ENSTGUP00000017615 pep:novel chromosome:taeGut3ti2ti4:23_random:205321:209117:1 gene:ENSTGUG00000017337 transcript:ENSTGUT00000018017 PDLRELVLMFEHLHRVRNGGFRNSEVKKWPDRSPPPYHSFTPAQKSFSLAGCSGESTKMG IKERMRLSSSQRQGSRGRQQHLGPPLHRSPSPEDVAEATSPTKVQKSWSFNDRTRFRASL RLKPRIPAEGDCPPEDSGEERSSPCDLTFEDIMPAVKTLIRAVRILKFLVAKRKFKETLR PYDVKDVIEQYSAGHLDMLGRIKSLQTRVEQIVGRDRALPADKKVREKGEKPALEAELVD ELSMMGRVVKVERQVQSIEHKLDLLLGLYSRCLRKGSANSLVLAAVRVPPGEPDVTSDYQ SPVEHEDISTSAQSLSISRLASTNMD

RNA - seq DNA Pre - mRNA mRNA Exon Intron Exon Intron Exon Intron Exon UTR UTR GT AG GT AG GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre - mRNA UTR UTR AA A A A A A ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

Should always be included in an annotation project Data used – RNA - seq Should always be included in an annotation project From the same organism as the genomic data => unbiased Can be very noisy (tissue/species dependent), can include pre - mRNA Some filtering method often needed

Spliced reads DNA Pre - mRNA mRNA Exon Intron Exon Intron Exon Intron UTR UTR GT AG GT AG GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre - mRNA UTR UTR AA A A A A A ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

Pre - mRNA DNA Pre - mRNA mRNA Exon Intron Exon Intron Exon Intron UTR UTR GT GT GT ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre - mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation

Annotating The tools and plans

Our layout for this project: As you can see, MAKER is at many places.

External data -­‐ proteins, RNA - seq (incl. ESTs) Combine data -­‐ use Maker! External data -­‐ proteins, RNA - seq (incl. ESTs) Ab‐initio gene finders (Lift‐overs from closely related genomes) Combined annotation

These HMM-models need to be trained! Ab initio gene finders are used in Maker Commonly used programs: Augustus, Snap, Genemark-ES, FGENESH, Genscan, Glimmer- HMM,… Uses HMM-models to figure out how introns, exons, UTRs etc. are structured These HMM-models need to be trained!

Maker will align proteins for you: Blast -­‐> Exonerate Data used - Proteins Maker will align proteins for you: Blast -­‐> Exonerate Blast is not structure aware, Exonerate is (splice sites, start/stop codons) Preferred file format: fasta

Cufflinks: mapped reads -­‐> transcripts How to use RNA - seq Maker will align transcripts (ESTs), but these need to be assembled first. Cufflinks: mapped reads -­‐> transcripts Trinity: assembles transcripts without a genome

Always combine different types of evidence! General recommendations Always combine different types of evidence! One single method is not enough! Use Maker!

Annotating Inside the pipeline If we were to look at all these from inside our pipeline. Inside the pipeline

Overview annotation = combining different lines of evidence into gene models Gene prediction Evidence Combining – MAKER This happens inside MAKER, but is also great for a manual curation.

A bit of terminology first: Gene prediction Goal: Finding the single most likely coding sequence (CDS) Gene annotation Goal: Identify the entire gene structure

Existing annotation pipelines – MAKER2 Maker – developed as an easy-to-use alternative to other pipelines Advantages over competing solutions: Almost unlimited parallelism built-in (limited by data and hardware) Largely independent from the underlying system where it is run on Everything is run through one command, no manual combining of data/outputs Follows common standards, produces GMOD compliant output annotation Edit Distance (AED) metric for improved quality control Provides a mechanism to train and retrain ab initio gene predictors annotations can be updated by re-launching Maker with new evidences But how does Maker work exactly?

Existing annotation pipelines – MAKER2 Step 1: Raw compute phase RepeatMasking Nucleotide repeats Transposons/viral proteins Soft-­‐masking Hard-masking ATGCGTTTGacgataataaaggGCATAGCCCT ATGCGTTTGNNNNNNNNNNGCATAGCCCT Masked genome

Existing annotation pipelines – MAKER2 Step 1: Raw compute phase Masked genome Blastx Blastn Proteins ESTs

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Filtering is based on rules defined in the Maker configuration for a given project Example: EST alignment – 80% coverage and 85% identity Default segngs sensible for most projects, but can be changed!

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Clustering groups evidence alignments into ’loci’

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Problematic data can complicate clustering Needs to be fixed by a) cleaner data or b) manual curation

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Clustering groups evidence alignments into ’loci’ Amount of data in any given cluster is then collapsed to remove redundancy Threshold for the collapsing is also user-­‐definable

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Clustering groups evidence alignments into ’loci’ Amount of data in any given cluster is then collapsed to remove redundancy Threshold for the collapsing is also user-­‐definable Performed for all lines of evidence

Existing annotation pipelines – MAKER2 Step 3: Polishing alignments Blast - based alignments are only approximations, need to be refined

Existing annotation pipelines – MAKER2 Step 3: Polishing alignments Blast-based alignments are only approximations, need to be refined Exonerate is used to create splice-aware alignments

Existing annotation pipelines – MAKER2 Step 4: Synthesis Synthesis refers to the extraction of information to generate evidence for annotations Done by identifying genomic regions overlapping with sequence features

Existing annotation pipelines – MAKER2 Step 4: Synthesis

Existing annotation pipelines – MAKER2 Step 4: Synthesis…and Ab-initio gene finding Evidence alignments provide support for the identifcation of gene loci Ab-initio predictions can enhance these signals and fill gaps with no evidence

Existing annotation pipelines – MAKER2 Step 4: Synthesis…and Ab-initio gene finding Ab-initio predictions can be improved when evidence is provided (hints) Help refine and calibrate a computational inference for a given locus

Existing annotation pipelines – MAKER2 Step 4: Synthesistititiand Ab-initio gene finding Ab‐initio predictions can be improved when evidence is provided (hints) Help refine and calibrate a computational inference for a given locus Hints: Introns, intergenic sequence, CDS

Existing annotation pipelines – MAKER2 Step 5: Annotate Refined Ab-initio models may still be incomplete / partially wrong Need to reconcile with evidence so we don’t miss information -­‐> Limited by agreement between Ab-initio profile and evidence

Existing annotation pipelines – MAKER2 The BILS annota-on pla0orm Existing annotation pipelines – MAKER2 Step 5: Annotate Synthesized transcript structures are compared against evidence to find UTRs Jacques Dainat, PhD BILS genome annotation pla<orm

Web Apollo What’s next Computational pipelines make mistakes -­‐ Need to be run very conservatively or require manual curation Web Apollo