Ensembl Compara Perl API Stephen Fitzgerald EBI - Wellcome Trust Genome Campus, UK compara.

Slides:



Advertisements
Similar presentations
“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Advertisements

Microme Workshop, EBI 7 th October 2013 Programmatic Access to Ensembl Bacteria Dan Staines Ensembl Genomes.
1 / 30 Data Mining with BioMart
INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
Gramene Comparative & Phylogenomics Resources for Plants Joshua C. Stein 1, William Spooner 1, Sharon Wei 1, Liya Ren 1, Doreen Ware 1,2 1 Cold Spring.
Basics of Comparative Genomics Dr G. P. S. Raghava.
1/30 Comparative Genomics. 2/30 Overview of the Talk Comparing Genomes Homologies & Families Sequence Alignments.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Ensembl Developers Meeting September 2008 Xosé Mª Fernández European Bioinformatics Institute.
April 2006 March 2007 Xosé Mª Fernández European Bioinformatics Institute Browsing Genomes with Ensembl.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
How to access genomic information using Ensembl August 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Aequatus Browser, an open-source web-based tool developed at TGAC to visualise homologous gene structures among differing species or subtypes of a common.
Nucleotide sequence alignments in Compara Stephen Fitzgerald
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API.
How to access genomic information using Ensembl Damian Smedley and Xosé Fernández Ensembl Project European Bioinformatics Institute Cambridge, UK November.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
An Introduction to Ensembl Presented By Hilary O. Pavlidis.
1/29 Comparative Genomics. 2/29 Overview of the Talk Comparing Genomes Homologies & Families Sequence Alignments.
EnsEMBL Opening up the whole Genome Philip Lijnzaad
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Data Mining in Ensembl with BioMart Nov,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
D A S for ENCODE data coordination Felix Kokocinski, WTSI.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Sept 2008 Ensembl Funcgen Perl API Nathan Johnson EBI - Wellcome Trust Genome Campus, UK Funcgen.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
1 of 42 Browsing Genes and Genomes with Ensembl Maria Wilbe Department of Animal Breeding and Genetics, SLU, Sweden
Data Mining in Ensembl with BioMart Giulietta Spudich.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
Copyright OpenHelix. No use or reproduction without express written consent1.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Ensembl. Going beyond A,T, G and C Ewan Birney. There is more to life than proteins (but not much) Ensembl ENCODE Reactome.
Kevin Howe, B. Aken, M. Caccamo, Y. Chen, L. Clarke, S. Dyer, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, R. Durbin, J. Fernandez-Banet, X.M.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
By Michael Han Sanger Wormbase Group SAB 2008 Comparative Genomics with.
Workshop practical Helsinki Workshop September 2006.
1 of 31 Dr. Giulietta M. Spudich European Bioinformatics Institute The Ensembl Browser.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Aequatus-vis Anil Thanki Scientific Programmer –
Data Loading into Ensembl Database TGAC Browser
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
The Ensembl Database Steven Jones August 18, 2004
Data Mining with BioMart
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Comparative Genomics.
INFORMATION FLOW AARTHI & NEHA.
Ensembl Genome Repository.
Basics of Comparative Genomics
Welcome - webinar instructions
Presentation transcript:

Ensembl Compara Perl API Stephen Fitzgerald EBI - Wellcome Trust Genome Campus, UK compara

What is Ensembl Compara? A single database which contains precalculated comparative genomics data Access via perl API and mysql A production system for generating that database (not in this presentation)

Compara data ProteinSequences Raw genomic sequence Whole genome alignments (tBLAT, BlastZ-net, PECAN) 46 species in Ensembl release-52 Syntenic regions ( based on BlastZ-net ) Raw Protein Alignments Protein Family clusters Protein trees Gene orthology / paraology predictions

Compara database & the Ensembl core databases Since there is minimal primary data inside Compara, to gain full access to the data external links with core DBs must be re- established Example: compara_52 must be linked with the Ensembl core_52 databases Proper REGISTRY configuration is critical Or load_registry_from_db is probably the best choice here

Written in Object-Oriented Perl Used to retrieve data from and store data into ensembl-compara database Generalized to extend to non-ensembl genomic data (Uniprot) Follows same Data Object & Object Adaptor DBAdaptor design as the other Ensembl APIs The Compara Perl API

Compara object model overview NCBITaxon GenomeDB DnaFrag Member MethodLinkSpeciesSet GenomicAlign GenomicAlignBlockSyntenyRegion DnaFragRegion HomologyFamily PRIMARY DATA ANALYSIS RESULTS Attribute ProteinTree AlignedMember

Primary data GenomeDB: relates to a particular Ensembl core DB name(), assembly(), genebuild(), taxon() fetch_by_name_assembly(), fetch_by_registry_name(), fetch_by_Slice(), fetch_all() DnaFrag: represents a top level SeqRegion name(), length(), genome_db(), slice(), coord_system_name() fetch_by_Slice(), fetch_by_GenomeDB_and_name() Member: list all Ensembl genes + SwissProt + SPTrEMBL source_name(), stable_id(), genome_db(), taxon(), sequence(), get_all_peptide_Members(), get_longest_peptide_Member(), gene_member() fetch_by_source_stable_id()

Analysis MethodLinkSpeciesSet provides a handle to isolate specific data from the shared tables (homology, genomic_align_block) MethodLink: Each individual analysis in compara is tagged with a unique name called a method_link_type BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY, ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES SpeciesSet: the sets of species as (a ref. to) an array of GenomeDBs fetch_by_method_link_type_GenomeDBs(), fetch_by_method_link_type_registry_aliases() name(), method_link_type(), species_set(), source()

Exercises GenomeDB 1. Find out the versions of human and mouse genomes in the database 2. Print the name of all the GenomeDBs in the database DnaFrag 1. Get the DnaFrag for the chromosome 1 of the macaque genome (using a genome_db object as an argument) 2. Get the DnaFrag for the chromosome X of the mouse genome (using a core slice object as an argument) MethodLinkSpeciesSet 1. Find out how many analyses are stored in the database 2. Get the name of the MethodLinkSpeciesSet corresponding to the BlastZ-net analysis for human and mouse 3. Get the names of the all the species using the mlss corresponding to the Pecan analyses

GenomeDB example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); print Name:,$genome_db->name,"\n"; print Assembly:,$genome_db->assembly,"\n"; print GeneBuild:,$genome_db->genebuild,"\n";

GenomeDB example code $> perl genome_db1.pl Homo sapiens NCBI Ensembl Mus musculus NCBIM Ensembl

DnaFrag example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); my $dnafrag_adaptor = $reg->get_adaptor( "Multi", "compara", "DnaFrag"); my $dnafrag = $dnafrag_adaptor-> fetch_by_GenomeDB_and_name($genome_db, "13"); print "Name:", $dnafrag->name, "\n"; print "Length:", $dnafrag->length, "\n"; print "CoordSystem:", $dnafrag->coord_system_name, "\n";

DnaFrag example code $> perl test1.pl Name :13 Length : CoordSystem :chromosome

MethodLinkSpeciesSet example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $mlssa = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet"); my $mlss = $mlssa-> fetch_by_method_link_type_registry_aliases( "BLASTZ_NET", ["human", "mouse"]); print $mlss->name, "\n"; print "type: ", $mlss->method_link_type, "\n"; my $species_set = $mlss->species_set(); foreach my $this_genome_db { print $this_genome_db->name(), "\n"; }

MethodLinkSpeciesSet example code $ > perl method_link_species_set.pl H.sap-M.mus blastz-net (on H.sap)

Genomic Alignments BlastZ-Net used to compare closely related pair of species BlastZ-raw -> BlastZ-chain -> BlastZ-net Translated BLAT used to compare more distant pair of species Pecan multiple global alignments all vs all coding exons wublastp -> Mercator -> Pecan on each syntenic block

GenomicAlignBlock represents a genomic alignment contains 1 GenomicAlign per sequence fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice) Methods: method_link_species_set(), score(), length(), perc_id(), get_all_GenomicAligns(), get_SimpleAlign() GenomicAlign dnafrag(), genome_db(), get_Slice(), dnafrag_start, dnafrag_end(), dnafrag_strand(), aligned_sequence()

GenomicAlignBlock $all_GAlign = $GABlock->get_all_GenomicAligns()$arrayref $Simplealign= $GABlock->get_SimpleAlign()$object $Simplealign:a bioperl object which contains the whole alignment - can be printed in various format using bioperl modules $Galign:an object which represents one of the sequences in the alignment only Hsap.X : ACCTTC-A<- $ga Cfam.X : ACC--CGA<- $ga

Synteny Based on BlastZ-net alignments SyntenyRegionAdaptor fetch_all_by_MethodLinkSpeciesSet_Slice(), fetch_all_by_MethodLinkSpeciesSet_DnaFrag() Methods: get_all_DnaFragRegions(), method_link_species_set(), DnaFragRegion slice(), dnafrag(), dnafrag_start(), dnafrag_end(), dnafrag_strand()

Exercises GenomicAlignBlock 1. Fetch all the BLASTZ_NET alignments between the first 130K nucleotides of the human chromosome X and the mouse genome. 2. Print the exact location of the alignment blocks. 3. Compare the original and the aligned sequences. 4. Find the BLASTZ_NET alignments between human gene BRCA2 and the mouse genome. 5. Print the BLASTZ_NET alignments between the rat gene ECSIT and the mouse genome. 6. Print the PECAN multiple alignments between the rat gene ECSIT and 11 other amniote vertebrates. 7. Print the constrained-element alignments within the rat ECSIT locus (use the constrained elements generated from the 12-way alignments). Synteny 1. Get the human-mouse syntenic map for human chromosome X.

GenomicAlignBlock example code [...] my $slice_adaptor = $reg->get_adaptor( "human", "core", "Slice"); my $slice = $slice_adaptor-> fetch_by_region("chromosome", "12", 1e4, 2e4); my $gaba = $reg->get_adaptor("Multi", "compara", "GenomicAlignBlock"); my $genomic_align_blocks = $gaba-> fetch_all_by_MethodLinkSpeciesSet_Slice( $method_link_species_set, $slice); foreach my $this_gab { my $all_gas = $this_gab->get_all_GenomicAligns(); foreach my $this_ga { print $this_ga->genome_db->name(), ":", $this_ga->get_Slice()->name(), "\n"; print $this_ga->aligned_sequence(), "\n"; } print "\n"; }

GenomicAlignBlock example code $>perl gab.pl Mus musculus:chromosome:NCBIM37:6: : :-1 CCTCTTAATAAACATTATTGTCAA[…] Homo sapiens:chromosome:NCBI36:12:19128:19507:1 CCTCTTAATAAGCACACATATCCT[..]

Synteny example code [...] my $synteny_region_adaptor = $reg->get_adaptor( "Multi", "compara", "SyntenyRegion"); my $synteny_regions = $synteny_region_adaptor-> fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_synteny_method_link_species_set, $human_slice); foreach my $this_synteny_region { my $these_dnafrag_regions = $this_synteny_region->get_all_DnaFragRegions(); foreach my $this_dnafrag_region { print $this_dnafrag_region->dnafrag-> genome_db->name, ": ", $this_dnafrag_region->slice->name, "\n"; } print "\n"; }

Homology (e! 38): Orthologue predictions based on best reciprocal blast hits Paralogues for a selected set of species No global view of the evolution history of the gene considered e! 39+: Orthologues and paralogues are inferred from protein trees Phylogeny: Orthology/Paralogy in one go

BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared, BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the initial clustering step is 0.33.

Homology types

Homology Homology object contains 1 pair of Member/Attribute per gene/protein fetch_all_by_Member(), fetch_all_by_MethodLinkSpeciesSet(), fetch_all_by_Member_MethodLinkSpeciesSet() Methods: method_link_species_set(), description(), subtype(), perc_id(), get_all_Member_Attribute(), get_SimpleAlign()

Family Compara compute gene family clusters Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT and Uniprot/SPTREMBL metazoan proteins The algorithm is based on : All vs all blastp MCL clustering Muscle multiple aligner Results stored in family, family_member tables

Family Family object contains 1 pair of Member/Attribute per gene/protein fetch_all by_Member() Methods: method_link_species_set(), description(), description_score(), get_all_Member_Attribute(), get_SimpleAlign()

Exercises Members 1. Find the Member corresponding to SwissProt protein O Find the Member for the human gene BRCA2 3. Find all the peptide Members corresponding to the human gene CTDP1 Homology 1. Get all the predicted homologues for the human gene BRCA2 2. Get all the mouse orthologues predicted for the human gene CTDP1 Family 1. Get family predicted for the human gene BRCA2 2. Get the alignments corresponding to the family of the human gene HBEGF

Member example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $member_adaptor = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $member_adaptor-> fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG "); print "All proteins:\n"; my $all_peptide_members = $member-> get_all_peptide_Members(); foreach my $this_peptide { print $this_peptide->stable_id(), "\n"; }

Member example code $> perl test2.pl All proteins: ENSP ENSP ENSP

Homology example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG "); my $homology_adaptor = $reg->get_adaptor( "Multi", "compara", "Homology"); my $homologies = $homology_adaptor-> fetch_all_by_Member($member); foreach my $this_homology { print $this_homology->description, "\n"; my $member_attributes = $this_homology-> get_all_Member_Attribute(); foreach my $this_mem_attr { my ($this_member, $this_attribute) print $this_member->genome_db->name, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; }

Family example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG "); my $family_adaptor = $reg->get_adaptor( "Multi", "compara", "Family"); my $families = $family_adaptor-> fetch_all_by_Member($member); foreach my $this_family { print $this_family->description, "\n"; my $member_attributes = $this_family-> get_all_Member_Attribute(); foreach my $this_mem_attr { my ($this_member, $this_attribute) print $this_member->taxon->binomial, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; }

Getting More Information perldoc – Viewer for inline API documentation. shell> perldoc Bio::EnsEMBL::Compara::GenomeDB shell> perldoc Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor online at: Tutorial document: cvs: ensembl-compara/docs/ComparaTutorial.pdf ensembl-dev mailing list: Exercise solutions:

Ensembl-dev mailing list and HelpDesk ensembl-dev mailing list is great for questions around the API and the DB HelpDesk is very helpful Give detailed info on what you are trying to do Check that you have the modules installed ($PERL5LIB pointing to them)

Guy Coates, Tim Cutts, Shelley GoddardSystems & Support Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel ZerbinoResearch Martin Hammond, Dan Lawson, Karyn MegyVectorBase Annotation Kerstin Jekosch, Mario Caccamo, Ian SealyZebrafish Annotation Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White Analysis and Annotation Pipeline Javier Herrero, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo GordonComparative Genomics James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)Web Team Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael SchusterOutreach Eugene Kulesha Distributed Annotation System (DAS) Arek Kasprzyk, Damian Smedley, Richard Holland, Syed HaldarBioMart Glenn Proctor, Ian Longden, Patrick Meidl, Andreas KähäriDatabase Schema and Core API Ensembl Team

A special case of ortholog