B Nameeta Shah 1, Michael Teplitsky 2, Len A. Pennacchio, 2,3, Philip Hugenholtz 3, Bernd Hamann 1, 2, and Inna Dubchak 2, 3 1 Institute for Data Analysis.

Slides:



Advertisements
Similar presentations
Huong Le Department of Molecular & Clinical Genetics, Royal Prince Alfred Hospital Click mouse to move to the next slide.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Applications of Visualization and Data Clustering to 3D Gene Expression Data Oliver Rübel 1,2,3,7, Gunther H. Weber 3,7, Min-Yu Huang 1,7, E. Wes Bethel.
GenomePixelizer - a visualization tool for comparative genomics within and between species. A. Kozik, E. Kochetkova, and R. Michelmore (Department of Vegetable.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
BIOINFORMATICS Ency Lee.
Outline to SNP bioinformatics lecture
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
A PCR-generated chimeric sequence usually comprises two phylogenetically distinct parent sequences and occurs when a prematurely terminated amplicon reanneals.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
KinSNP Software for homozygosity mapping of disease genes using SNP microarrays El-Ad David Amir 1, Ofer Bartal 1, Yoni Sheinin 2, Ruti Parvari 2 and Vered.
Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
NGS Analysis Using Galaxy
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
SAGExplore web server tutorial for Module II: Genome Mapping.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How.
Construction of Substitution Matrices
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
SAGExplore web server tutorial for Module I: Genome Explore.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Copyright OpenHelix. No use or reproduction without express written consent1.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
SAGExplore web server tutorial. The SAGExplore server has three different modules …
Supplemental Figure 1. False trans association due to probe cross-hybridization and genetic polymorphism at single base extension site. (A) The Infinium.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
1 Bioinformatics Tools for Genotyping Frances Tong Dr. Garry Larson, Ph.D City of Hope Department of Molecular Medicine Southern California Bioinformatics.
Week-6: Genomics Browsers
Pipelines for Computational Analysis (Bioinformatics)
Figure 1. Exploring and comparing context-dependent mutational profiles in various cancer types. (A) Mutational profiles of pan-cancer somatic mutations,
Lettuce/Sunflower EST CGPDB project.
By Michael Fraczek and Caden Boyer
Acknowledgements and References
Jianbin Wang, H. Christina Fan, Barry Behr, Stephen R. Quake  Cell 
Discovery and Characterization of piRNAs in the Human Fetal Ovary
Basic Local Alignment Search Tool
Explore Evolution: Instrument for Analysis
Volume 18, Issue 5, Pages (November 2015)
BLAT Blast Like Alignment Tool
Genes Encode RNAs and Polypeptides
Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes
Evolutionary History of the ADRB2 Gene in Humans
Changes in mutation rate or protein abundance are not observed in HATs when comparing rho+ to rho0 cells. Changes in mutation rate or protein abundance.
Volume 21, Issue 23, Pages (December 2011)
Presentation transcript:

b Nameeta Shah 1, Michael Teplitsky 2, Len A. Pennacchio, 2,3, Philip Hugenholtz 3, Bernd Hamann 1, 2, and Inna Dubchak 2, 3 1 Institute for Data Analysis and Visualization (IDAV), Department of Computer Science, University of California, Davis, One Shields Ave., Davis, CA 95616; 2 Genomics Division, Lawrence Berkeley National Laboratory, One Cyclotron road, Berkeley, CA, 94720; 3 DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA SNP-VISTA: AN INTERACTIVE SNPs VISUALIZATION TOOL Single Nucleotide Polymorphisms (SNPs) are established genetic markers that aid in the identification of loci affecting quantitative traits and/or disease in wide variety of eukaryote species. In addition, SNPs have been used extensively in efforts to study the evolution of microbial populations. Such efforts have largely been confined to multi-locus sequence typing of clinical isolates of bacterial species. However, the recent application of random shotgun sequencing to environmental samples makes possible more extensive SNP analysis of co- occurring and co-evolving microbial populations. Tools for visualization of ecogenomics data are in their infancy. An intriguing finding reported in the Tyson et al. study (2004) was the mosaic nature of the genomes of an archaeal population, inferred to be the result of extensive homologous recombination of three ancestral strains. This observation was based on a manual analysis of a small subset of the data (ca. 40 kbp) and remains to be verified across the whole genome. We present an interactive visualization tool, SNP-VISTA, to aid in analyzes for these types of data: Large-scale resequencing data of disease-related genes for discovery of associated and/or causative alleles (GeneSNP-VISTA) Massive amounts of ecogenomics data for studying homologous recombination in microbial populations (EcoSNP-VISTA). The main features and capabilities of SNP-VISTA are: 1)Mapping of SNPs to gene structure; 2)Classification of SNPs based on their location in the gene, frequency of occurrence in samples and allele composition; 3)Clustering based on user-defined subsets of SNPs, highlighting haplotypes as well as recombinant sequences; 4)Integration of protein conservation visualization; and 5)Display of automatically calculated recombination points that are user-editable. The main advantage of SNP-VISTA is derived from its graphical interface and visual representation of these data, which support interactive exploration and hence better understanding of large-scale SNPs data. Tyson et al., Nature. 2004, 428(6978): Overview GeneSNP-VISTA for discovery of disease-related mutations in genes. Contact GeneSNP-VISTA screenshot for ABO blood group (transferase A, alpha 1-3-N-acetylgalactosaminyltransferase; transferase B, alpha 1.3.galactosyltransferase) gene. EcoSNP-VISTA screenshot of scaffold 1 of the microbial genome of ferroplasma II. We used the acid mine drainage dataset publicly available at INPUT Alignment data This file should contain the blast output obtained by blasting the consensus sequence against all reads in the database. Annotation file Similar to GeneSNP-VISTA annotation file. Recombination points (Optional) This file must be a tab-delimited file with four fields on each line, in the format: Sample input files are available on the website. Following modification are made to GeneSNP-VISTA for application to ecogenomics data: Nucleotide based color scheme Each cell in the array is colored based on the nucleotide at that SNP position. Once the reads are clustered this representation allows a user to discern various SNP patterns probably corresponding to different strains (2.A). Recombination point calculation and visualization A user can provide recombination points obtained from another program or they can be calculated within SNP-VISTA. The recombination point calculation is based on the bellerophon program (Huber et al., 2004). Our tool displays recombination points on the coordinate bar using blue lines showing the global view along with the frequency of SNPs (2.B). The array representation also shows the exact position of the recombination point with two black triangles (2.C). The reads can be examined closely in a window as shown in figure 2.D. A user can visually verify the recombination points and accept them or reject them. It is also possible to add a recombination point. Automatic recombination point calculation results in a lot of false positives whereas manual detection of recombination points is a very tedious job. SNP-VISTA combines both approaches to provide a feasible method for detecting recombination points EcoSNP-VISTA for discovery of recombination points in microbial population A. Coordinate bar showing the gene structure. ABO gene is 23,758 basepairs long and there are seven exons displayed as blue rectangles. The red rectangle is user selected region. B. SNPs are represented as an array of samples (rows) x polymorphic sites (columns), where each cell is colored based on the SNP classification. Blue color is used for common homozygous SNP, yellow color is used for rare homozygous SNP, red color is used for heterozygous SNP and a black dot is used for non-synonymous SNP. C. Clustering results are shown as a hierarchical tree where each node can be collapsed or expanded. D. A window displaying the protein alignment. The display is linked with the non-synonymous SNP selected by the user. C A B D A. SNPs are represented as an array of reads (rows) x polymorphic sites (columns), where each cell is colored based on the nucleotide. Red color is used for nucleotide T (Thyamine), blue color is used for nucleotide A (Adenine), yellow color is used for nucleotide C (Cytosine) and green color is used for nucleotide G (Guanine). B. Coordinate bar showing the global view of recombination points shown with blue lines along with the frequency of SNPs, where black indicates higher frequency. C. The array representation showing the exact position of the recombination point with two black triangles. D. A window displaying the blast alignment for the selected region. B A C D INPUT. All file formats are available on the Web Site. Reference sequence This file should contain the DNA sequence of the gene in fasta format. Annotation file This file must be a tab-delimited file with annotation for exons and coding sequence (cds) SNPs data This file must be a tab-delimited file with four fields on each line, in the format: Protein alignment This file should contain the protein alignment in multi-fasta format. SNP-VISTA has following features: Mapping of SNPs to the gene structure A SNP can be in UTR, exon, intron or splice site. Such information about the location of SNPs is very valuable to biologists. We map SNPs to the gene structure as shown in figure 1.A. A coordinate bar represents the ABO blood group gene, which is kbp long and has 7 exons that are shown by blue rectangles. Red rectangle is the user selected subregion of the gene. Green lines show the exact location of each SNP on the gene. On mouse over the connecting line is highlighted with red color. Classification of SNPs A SNP can be homozygous, heterozygous, synonymous or non-synonymous. We classify SNPs and use different colors for each class of SNPs. The graphical representation is similar to VG2 where selected data is represented as an array of samples (rows) x polymorphic sites (columns), where each cell is colored depending on the classification of SNPs based on their location in the gene, frequency of occurrence in samples and allele composition (See figure 1.B). On mouse over detailed information like sample id, position, frequency, etc. about the selected SNP is displayed in a semi-transparent callout. Clustering Clustering of samples based on the their patterns of SNPs allows a user to easily navigate through the data. We use levenstein software to perform the hierarchical clustering. Clustering can be performed using all the SNPs in the data or user-selected subset. SNP-VISTA displays the hierarchical tree (See figure 1.C) where each node can be collapsed or expanded. Figure 1 shows the result of clustering samples by using SNPs in the last exon. Integration of multiple alignments of homologous proteins in different species One of the approaches to assess how significant is the SNP that changes an amino acid is to look at the conservation of that amino acid across multiple species. A SNP causing change in a conserved amino acid is more likely to be a causative mutation. Integration of multiple alignments of homologous proteins will allow a biologist to see if a SNP has caused a conserved amino acid to change. SNP-VISTA displays the protein alignment along with Entropy or Sum-of-Pairs similarity score in protein alignment window (See figure 1.D). When a user selects a non-synonymous SNP, the corresponding amino acid is highlighted in green. In figure 1, user has selected a heterozygous non-synonymous SNP in the last exon which changes amino acid Phenylalanine (F) to Isoleucine (I). The protein alignment window shows the conservation of this amino acid, which is 100% conserved.