VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How can we leverage genome sequences from many species to learn about genome function? Microbial applicationsMicrobial applications Inna Dubchak, Genomics Division LBNL, JGI
Human Genome Annotation Gene A only 1–2% codingonly 1–2% coding efficient identification of regulatory sequences?efficient identification of regulatory sequences?
Sequence conservation implies function AGTTGAAAC GGAGCTGATGGAGC GGTGGGC T TACATTTCG ACTGTATCGCCTCG CAACCCT A potential functional region conservation sequence CTATAAATGC CTATAAATGC AC AC Last Common Ancestor divergence= non functional functional region =conservation 80 million years
Comparative Genomics Introduction Human Drosophila Mouse Urchin Chimp Similar Genes Synteny Sequence Alignment
VISTA is an integrated system for global sequence alignment and visualization for comparative genomic analysis
Algorithm Feature AVID *can handle draft sequence LAGAN ** produces true multiple alignments Shuffle-LAGAN ** handles rearrangements (inversions, translocations) * Lior Pachter, UC Berkeley ** Michael Brudno, U. Toronto How does VISTA Work: Global Genomic Aligments sequence 1 sequence 2 1- anchoring: identify regions of strong similarity 2- chaining: join regions of weak or no similarity
TCCCCAACTATAAATGGATGAAATTGCAGGAAATGACAGGTA-----TGACCCCTTCTCT >>>>>>>>> ||| ||| | |||||| | || || | | | ||||||| || <<<<<<<<< TCCTCAATTCAGAATGGAGGGAAGCACACAGGACACAGAGATCCCTTTACCCCCTTCGCT ACCAGAGGCTTGGATTTTTTTTCTTCTTCTCCTCCCTTAGCCCGTGTTGAGCTATTTCGG >>>>>>>>> | | | || | | | <<<<<<<<< ATGT TATCAGGCCACTCAAG AGTTTCCTGGCAGGGAAGAGCGAGTGAGGCTGCCTTACCTTCAGGATGACCACTAGCAGG >>>>>>>>> |||| | || || | ||||| ||||||| | ||| ||||||| ||||||||| |||||| <<<<<<<<< AGTTCCTTGTCAAG-AAGAGTGAGTGAGTCCACCTCACCTTCAAGATGACCACCAGCAGG CCAGCGCTCACAAGAAGAGGAATGAGGCTACTAATGAACCAGCTAAACCAGAGGATGCTG >>>>>>>>> |||||||||||||| ||||| |||||||| |||| |||||||||||||||||||||| <<<<<<<<< CCAGCGCTCACAAGCAGAGGGATGAGGCTGCTAACAAACCAGCTAAACCAGAGGATGCCA TTGTCCAGGCCCATGATCCGCATGGTCTCTTTCAGCCGTGCCTCCTTCTCATACACGATG >>>>>>>>> |||||||| |||||||||||||||||||| |||||||| ||||||||||||||||| ||| <<<<<<<<< TTGTCCAGACCCATGATCCGCATGGTCTCCTTCAGCCGAGCCTCCTTCTCATACACAATG CCCTTGATGATCACAGCCACTGAGTAAATCCAGGCCAGCGTCATGAAGAGGGGCATTGAC >>>>>>>>> | ||||||||||||||| || ||||| |||||||| || ||||||||||||||||||||| <<<<<<<<< CTCTTGATGATCACAGCGACAGAGTAGATCCAGGCTAGAGTCATGAAGAGGGGCATTGAC CGGCTCATCACCCGCAGAAAGCTGGAGGCCCCAAGGAAGGACAAGGGGAGAAAGAAAGAC >>>>>>>>> |||||||| ||||||||||| |||||||| | || || | || ||| | || |||| <<<<<<<<< CGGCTCATGACCCGCAGAAAACTGGAGGCACAGAGAAAAGGCATGGGAAAAATGAAAAGT ACACGTGAGCCAGGGTGATGGGCCAAGGCCTCTGAGCCTGCATGCTAGAGGGAGCACCAC >>>>>>>>> ||||||| || | ||||||||| |||| || |||| ||| | <<<<<<<<< GTGAGCCCGG-CACCGATCCAAGGCCT TGCACACTGGAGGACAAACCTC ATCTGGGCCACAGAAGGACAGGCCCTCTAGACTCTGAAATGTACGTATGATCCAATGCTT >>>>>>>>> ||| ||| | | | | | |||||| || ||||| ||||| | | || | || <<<<<<<<< ATCAGGGTCGCTTATGAA-AGGCCCACTGAACTCTCAAATG ACCAAAGGTTT CACGAGCAATGCAATGTAGAGAGAAAAACGAGGCTAACAAAGTGTTGCCAAACCAAATTT >>>>>>>>> || |||| || | ||||| ||| | || | | || | ||| | |||||| <<<<<<<<< CATTAGCAGTGGA---CAGAGATGAAACCTGGGTTTCGAGGGTATGGCCGTGCAAAATTT CTTTGGGGGCTTGCTTCAGTAACTAGGTAACTGTGAGCGATAC-TTAAACTAAAGGTAGA >>>>>>>>> || |||||| ||| | || ||||| || | || | | |||| |||| || <<<<<<<<< TTTCAGGGGCTCTCTTTAATAGCTAGGAAATGGATAGGGTAATATTAAGATAAATATAAG TTATGTTA--AAGTACTAAAAACCAAAACA------AAAAAACAACTCATTCTCTCACAA >>>>>>>>> ||| || |||||||||| || || | || ||||| ||| | | | <<<<<<<<< TTACTCTACTAAGTACTAAACACAAAGGGCGGGGGCAGAATCCAACTTGGTCTTCCGCTA Global Genomic Aligner Output
VISTA visualization GTAGTGCCACTGAGTGTGACAGGGATGGCAAGAAAAGCATTAAGTTCCAAGGGGAAAGAA >>>>>>>>> | || ||| ||| |||| |||||||||| | || || |||| | |||||||| <<<<<<<<< GAGATGTCACCAAGTA-AACAGAGATGGCAAGAGGACCAATAGGTTCTAGTGGGAAAGAC “sliding window” to measure sequence conservation (default window size 100bp) Graphical presentation of sequence conservation as “peaks-and-valley” curve >70% identity base sequence coordinates % identity
VISTA homepage: VISTA Servers (submit your own data) VISTA Browsers (precomputed alignments) Other VISTA-related Projects Access servers, browsers, other information
wgVISTA Align and compare sequences, including microbial assemblies mVISTA Align and compare sequences rVISTA Search for TFBS combined with a comparative sequence analysis VISTA Servers GenomeVISTA Align DNA sequence to a genome
VISTA Browser Browse through pre-computed whole-genome alignments Whole Genome rVISTA Whole genome analysis for conserved TFBS over-represented in upstream regions of genes Precomputed Alignments VISTA-Point Browse and obtain sequence and alignment data
VISTA Browser: Access
VISTA Browser: Input Menu genomeposition visualization Java 2, if needed Choose “base” genome Select location Determine visualization preference VISTA Browser VISTA tracks on UCSC Browser VISTA-Point
VISTA Browser: Alignment Details direction exon repeats alignment SNPs gene
VISTA Browser: Result Position on chromosome Control Panel Graphical display of genome alignments Color Legend Cursor Info Menu & Icons Curve annotation (species) 1 row
VISTA Browser: Zooming vs. rhesus vs. dog
VISTA browser
VISTA Point: Access Overview
VISTA Point: Graphics Table
VISTA Point: AlignmentsTable sequence
Google map-like Dot-Plot
BlockView – Synteny Plot tool
RegTransBase – experimental data manually curated database of regulatory interactions captured from literature; 6000 papers RegPrecise – computational predictions manually curated database of regulons inferred by comparative genomics approach RegPredict – web tool for regulon inference integrated system for fast and accurate inference of regulons by comparative genomics NAR database issue, 2010; Featured Article NAR Web Server issue, 2010; Featured Article Principal components NAR database issue, 2007
mVISTA: Access
mVISTA: Interface Our example will show 3 sequences Align up to 100 sequences
mVISTA: Input of Sequences Provide your address Upload your sequences Or enter GenBank ID your upload file or GenBank ID
AVID multiple pair wise alignments accepts finished or draft sequences LAGAN true multiple alignments mVISTA: Input Parameters Shuffle-LAGAN –multiple pair wise alignments –detects sequence rearrangements and inversions
mVISTA: Results PDF VISTA Browser VISTA -Point
wgVISTA: Microbial Assemblies Comparison wgVISTA: whole genome VISTA Compares 2 sequences (up to 10 Mb) Draft or finished microbial assembly sequences can be used
rVISTA: Access
Regulatory VISTA (rVISTA): prediction of transcription factor binding sites Simultaneous searches of the major transcription factor binding site database (Transfac) and the use of global sequence alignment to sieve through the data rVISTA search is automatically run when submitting: mVISTA mVISTA genomeVISTA genomeVISTA
Human TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTG Mouse TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCA Dog TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCA Rat TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCA Cow TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCT Rabbit TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA Ikaros-2 Ikaros-2 NFAT Ikaros-2 20 bp dynamic shifting window >80% ID 1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC) 2. Identify aligned sites using VISTA 3. Identify conserved sites using dynamic shifting window Regulatory VISTA (rVISTA):
rVISTA: Interface your sequences rVISTA sequence submission: set number Submit address, sequences, and set parameters Key step: click the box for: Find potential transcription factors
rVISTA: Select TRANSFAC Matrices
rVISTA: Mailed Results ed results will provide a link Choose which binding sites matrices to display You can then choose visualization options display
rVISTA: Results Graphic Blue all transcription factor (TF) binding sites Red TF sites which are aligned in both sequences Green TF sites which are aligned & in conserved regions sequences sites
Whole Genome rVISTA: Access
Whole Genome rVISTA: Select Alignment IDs or symbols upstream range
Whole Genome rVISTA: Results sites found view genes
Examples of VISTA usage Non-coding regulatory regions, for example enhancers Genes from the same gene families Alternative splicing Transcriptional regulation Genetic studies References collected are available through the Publications link at the VISTA home page
VISTA-related Publications
VISTA thanks BiologyGenomics Division, LBNL lead by Dr. Edward Rubin Dario BoffelliKelly Frazer Gaby Loots Len PennacchioMarcelo Nobrega Axel Visel Bioinformatics Michael BrudnoOlivier Couronne Simon Minovitsky Igor RatnerAlexander Poliakov Lior Pachter (UCB) Shyam PrabhakarDmitriy RyaboyNameeta Shah Inna Dubchak