Annotating The data.

Slides:

Advertisements

Similar presentations

An Introduction to Bioinformatics Finding genes in prokaryotes.

Advertisements

CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.

Finding Eukaryotic Open reading frames.

Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.

1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.

Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.

Eukaryotic Gene Finding

UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.

Eukaryotic Gene Finding

Genome Annotation BCB 660 October 20, From Carson Holt.

Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Why Manual Genome Annotation? Even the best gene predictors and genome annotation pipelines rarely exceed accuracies of 80% at the exon level, meaning.

Gene Structure and Identification

International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.

Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.

Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.

Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.

Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)

BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.

DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.

1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.

Genome Annotation Rosana O. Babu.

Mark D. Adams Dept. of Genetics 9/10/04

From Genomes to Genes Rui Alves.

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

How can we find genes? Search for them Look them up.

Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.

Research about Alternative Splicing recently 楊佳熒.

Annotation of eukaryotic genomes

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.

Using DNA Subway in the Classroom Genome Annotation: Red Line.

Basics of Genome Annotation Daniel Standage Biology Department Indiana University.

Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.

Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

Web Databases for Drosophila

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on

Bacterial infection by lytic virus

Annotation for D. virilis

bacteria and eukaryotes

Bacterial infection by lytic virus

”Gene Finding in Eukaryotic Genomes”

EGASP 2005 Evaluation Protocol

Eukaryotic Gene Structure

VectorBase genome annotation

EGASP 2005 Evaluation Protocol

Genome Sequence Annotation Server

Genome Sequence Annotation Server

Gene architecture and sequence annotation

Genome Editing with Apollo

Gene Annotation with DNA Subway

Genome Annotation w/ MAKER

Introduction to Bioinformatics II

Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang

Ensembl Genome Repository.

2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA

Reading Frames and ORF’s

Follow-up from last night: XSEDE credits

4. HMMs for gene finding HMM Ability to model grammar

Introduction to Alternative Splicing and my research report

Gene Structure.

DNA Crash course…..

Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.

Gene Structure.

Presentation transcript:

Annotating The data

• 2) Functional: Find out what the regions do. What do they code for? There are two major parts of annotation •  1) Structural: Find out where the regions of interest (usually genes) are in the genome and what they look like. How many exons/introns? UTRs? Isoforms? •  2) Functional: Find out what the regions do. What do they code for?

Open reading frames Just look at ORFs?

Diﬃcult in practice Example from apollo

No introns, but where are the start and stop codons? Transcriptomes are diﬀerent but have their own challenges No introns, but where are the start and stop codons? Still needs functional annotation

Data used - Proteins

Data used - Proteins Conserved in sequence => conserved annotation with little noise Proteins from model organisms onen used => bias? Proteins can be incomplete => problems as many annotation procedures are heavily dependent on protein alignments >ENSTGUP00000017616 pep:novel chromosome:taeGut3ti2ti4:8_random:2849599:2959678:-‐1 gene:ENSTGUG00000017338 transcript:ENSTGUT000000180 RSPNATEYNWHHLRYPKIPERLNPPAAAGPALSTAEGWMLPWGNGQHPLLARAPGKGRER DGKELIKKPKTFKFTFLKKKKKKKKKTFK >ENSTGUP00000017615 pep:novel chromosome:taeGut3ti2ti4:23_random:205321:209117:1 gene:ENSTGUG00000017337 transcript:ENSTGUT00000018017 PDLRELVLMFEHLHRVRNGGFRNSEVKKWPDRSPPPYHSFTPAQKSFSLAGCSGESTKMG IKERMRLSSSQRQGSRGRQQHLGPPLHRSPSPEDVAEATSPTKVQKSWSFNDRTRFRASL RLKPRIPAEGDCPPEDSGEERSSPCDLTFEDIMPAVKTLIRAVRILKFLVAKRKFKETLR PYDVKDVIEQYSAGHLDMLGRIKSLQTRVEQIVGRDRALPADKKVREKGEKPALEAELVD ELSMMGRVVKVERQVQSIEHKLDLLLGLYSRCLRKGSANSLVLAAVRVPPGEPDVTSDYQ SPVEHEDISTSAQSLSISRLASTNMD

RNA - seq DNA Pre - mRNA mRNA Exon Intron Exon Intron Exon Intron Exon UTR UTR GT AG GT AG GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre - mRNA UTR UTR AA A A A A A ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

Should always be included in an annotation project Data used – RNA - seq Should always be included in an annotation project From the same organism as the genomic data => unbiased Can be very noisy (tissue/species dependent), can include pre - mRNA Some ﬁltering method often needed

Spliced reads DNA Pre - mRNA mRNA Exon Intron Exon Intron Exon Intron UTR UTR GT AG GT AG GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre - mRNA UTR UTR AA A A A A A ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

Pre - mRNA DNA Pre - mRNA mRNA Exon Intron Exon Intron Exon Intron UTR UTR GT GT GT ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre - mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation

Annotating The tools and plans

Our layout for this project: As you can see, MAKER is at many places.

External data -‐ proteins, RNA - seq (incl. ESTs) Combine data -‐ use Maker! External data -‐ proteins, RNA - seq (incl. ESTs) Ab‐initio gene ﬁnders (Lift‐overs from closely related genomes) Combined annotation

These HMM-models need to be trained! Ab initio gene ﬁnders are used in Maker Commonly used programs: Augustus, Snap, Genemark-ES, FGENESH, Genscan, Glimmer- HMM,… Uses HMM-models to ﬁgure out how introns, exons, UTRs etc. are structured These HMM-models need to be trained!

Maker will align proteins for you: Blast -‐> Exonerate Data used - Proteins Maker will align proteins for you: Blast -‐> Exonerate Blast is not structure aware, Exonerate is (splice sites, start/stop codons) Preferred ﬁle format: fasta

Cufflinks: mapped reads -‐> transcripts How to use RNA - seq Maker will align transcripts (ESTs), but these need to be assembled first. Cufflinks: mapped reads -‐> transcripts Trinity: assembles transcripts without a genome

Always combine diﬀerent types of evidence! General recommendations Always combine diﬀerent types of evidence! One single method is not enough! Use Maker!

Annotating Inside the pipeline If we were to look at all these from inside our pipeline. Inside the pipeline

Overview annotation = combining diﬀerent lines of evidence into gene models Gene prediction Evidence Combining – MAKER This happens inside MAKER, but is also great for a manual curation.

A bit of terminology ﬁrst: Gene prediction Goal: Finding the single most likely coding sequence (CDS) Gene annotation Goal: Identify the entire gene structure

Existing annotation pipelines – MAKER2 Maker – developed as an easy-to-use alternative to other pipelines Advantages over competing solutions: Almost unlimited parallelism built-in (limited by data and hardware) Largely independent from the underlying system where it is run on Everything is run through one command, no manual combining of data/outputs Follows common standards, produces GMOD compliant output annotation Edit Distance (AED) metric for improved quality control Provides a mechanism to train and retrain ab initio gene predictors annotations can be updated by re-launching Maker with new evidences But how does Maker work exactly?

Existing annotation pipelines – MAKER2 Step 1: Raw compute phase RepeatMasking Nucleotide repeats Transposons/viral proteins Soft-‐masking Hard-masking ATGCGTTTGacgataataaaggGCATAGCCCT ATGCGTTTGNNNNNNNNNNGCATAGCCCT Masked genome

Existing annotation pipelines – MAKER2 Step 1: Raw compute phase Masked genome Blastx Blastn Proteins ESTs

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Filtering is based on rules deﬁned in the Maker conﬁguration for a given project Example: EST alignment – 80% coverage and 85% identity Default segngs sensible for most projects, but can be changed!

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Clustering groups evidence alignments into ’loci’

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Problematic data can complicate clustering Needs to be ﬁxed by a) cleaner data or b) manual curation

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Clustering groups evidence alignments into ’loci’ Amount of data in any given cluster is then collapsed to remove redundancy Threshold for the collapsing is also user-‐deﬁnable

Existing annotation pipelines – MAKER2 Step 2: Filter and cluster alignments Clustering groups evidence alignments into ’loci’ Amount of data in any given cluster is then collapsed to remove redundancy Threshold for the collapsing is also user-‐deﬁnable Performed for all lines of evidence

Existing annotation pipelines – MAKER2 Step 3: Polishing alignments Blast - based alignments are only approximations, need to be reﬁned

Existing annotation pipelines – MAKER2 Step 3: Polishing alignments Blast-based alignments are only approximations, need to be reﬁned Exonerate is used to create splice-aware alignments

Existing annotation pipelines – MAKER2 Step 4: Synthesis Synthesis refers to the extraction of information to generate evidence for annotations Done by identifying genomic regions overlapping with sequence features

Existing annotation pipelines – MAKER2 Step 4: Synthesis

Existing annotation pipelines – MAKER2 Step 4: Synthesis…and Ab-initio gene ﬁnding Evidence alignments provide support for the identifcation of gene loci Ab-initio predictions can enhance these signals and ﬁll gaps with no evidence

Existing annotation pipelines – MAKER2 Step 4: Synthesis…and Ab-initio gene ﬁnding Ab-initio predictions can be improved when evidence is provided (hints) Help reﬁne and calibrate a computational inference for a given locus

Existing annotation pipelines – MAKER2 Step 4: Synthesistititiand Ab-initio gene ﬁnding Ab‐initio predictions can be improved when evidence is provided (hints) Help reﬁne and calibrate a computational inference for a given locus Hints: Introns, intergenic sequence, CDS

Existing annotation pipelines – MAKER2 Step 5: Annotate Reﬁned Ab-initio models may still be incomplete / partially wrong Need to reconcile with evidence so we don’t miss information -‐> Limited by agreement between Ab-initio proﬁle and evidence

Existing annotation pipelines – MAKER2 The BILS annota-on pla0orm Existing annotation pipelines – MAKER2 Step 5: Annotate Synthesized transcript structures are compared against evidence to ﬁnd UTRs Jacques Dainat, PhD BILS genome annotation pla<orm

Web Apollo What’s next Computational pipelines make mistakes -‐ Need to be run very conservatively or require manual curation Web Apollo