Lecture #4 ABPG+BRIM Exome sequencing project Alexei Fedorov

Slides:



Advertisements
Similar presentations
DNAseq analysis Bioinformatics Analysis Team
Advertisements

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
High Throughput Sequencing
SMART/FHIR Genomic Resources An overview... For latest see
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Applications Software. Applications software is designed to perform specific tasks. There are three main types of application software: Applications packages.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Guideline for ClinLabGeneticist tool Jinlian Wang
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
File formats Wrapping your data in the right package Deanna M. Church
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
SAGExplore web server tutorial for Module II: Genome Mapping.
SAM 2010 v1.5 Student Walkthrough. Initial Set Up 1.Ensure that you are connected to the Internet. 2.Launch your web browser (Internet Explorer 7 or 8,
NGS data analysis CCM Seminar series Michael Liang:
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
SAGExplore web server tutorial for Module I: Genome Explore.
SMART/FHIR Genomic Resources An overview.... Change Log Added a few changes to Sequence resource Added data support for alignment data (e.g. SAM or BAM.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
ChrGeneticist introduction for reviewer Jinlian Wang 10/8/2014.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Genome STRiP ASHG Workshop demo materials
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
SAGExplore web server tutorial. The SAGExplore server has three different modules …
Accessing and visualizing genomics data
By Cathy Carmody.  Classroom News  Commercials/Skits/documentary –  Stories ON-line  Culminating activities  Independent enrichment project  Open.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Collecting Copyright Transfers and Disclosures via Editorial Manager™ -- Editorial Office Guide 2015.
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
Visualizing data from Galaxy
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
From Reads to Results Exome-seq analysis at CCBR
GIAB: Genome reference material development resources for clinical sequencing Chunlin Xiao 1, Justin Zook 2, Shane Trask 1, Melissa Landrum 1, Marc Salit.
WinSCP SUSAN WOYTON ITEC400 APRIL 5, What is WinSCP? Windows based Program Transfers files using File Transfer Protocol (FTP) Secure File Transfer.
Canadian Bioinformatics Workshops
Working with the Human Genome
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Development Environment
Cancer Genomics Core Lab
Integrative Genomics Viewer (IGV)
Task Analysis – Input to Interaction
Disease risk prediction
Introduction to Computer Science
Variant Calling Workshop
LMEvents SharePoint Portal How-to Guide
First Bite of Variant Calling in NGS/MPS Precourse materials
MiSeq Validation Pipeline
How to Use Members Area of The Ninety-Nines Website
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
BF528 - Genomic Variation and SNP Analysis
TC 310 The Computer in Technical Communication
Canadian Bioinformatics Workshops
Sequence Visualization
Computational Pipeline Strategies
Presentation transcript:

Lecture #4 ABPG+BRIM Exome sequencing project Alexei Fedorov TOPICS: Files: Fastq, sam, bam, vcf, bed; Reference Human Genome; Transferring files (FTP, SFTP, wget)

Input data: “Fastq” files SG5_R1.fastq_1000 (first 1000 lines of 11 GB) SG5_R2.fastq_1000 R1 and R2 stands for opposite strands QUESTION (homework #1) Do corresponding R1 and R2 sequences from the same read (identifier) overlap? How? What is the total length of the read?

How to download the data? FTP Credentials: Link: ftp.medgenome.com Port: 21 User: wuvonasi4074 Password: B!h#$21%arD5 Our attempts: ftp, wget, winscp (ftp graphical interface). FTP to NCBI (USER: anonymous; password =email [any works])

Practice: Download Human Chromosome 19, build 37.5 Use NCBI web page for downloading (ftp) User: anonymous; Password “your_email_address” HOMEWORK #2: Download the gbk file for one of the human chromosomes. Describe “gbk” format of the human genome representation (write 0.5-1 page)

Letter of agreement DataAccessAgreementAshkenazi.doc DataAccessAgreementUT_BioinfoStudent.doc every student must sign and return (ultimate HOMEWORK).

Our task: Variant calling (SNPs and INDELs) should be performed using standard procedures of GATK or Samtools. Summary reports of variants called from each sample should be provided; number of SNPs, INDELs, homozygous and heterozygous variants and transition to transversion ratio, read depth and quality distribution of identified variants in each sample should be provided. Variant calling should be done individually and together (all samples together to produce single VCF file for all samples). Variant filtration should be done using standard procedures and parameters.

QUIZ: What is it? Why autosomal recessive? Why compound heterozygote? What is consanguineous family?

Example of ‘SAM’ files Neandertal200.sam (first 200 lines) SAMV1.pdf (instruction to SAM format)

SAM <-> BAM file conversion Binary “BAM” file is ~3 times smaller than SAM, yet is unreadable

Integrative Genomics Viewer (IGV) https://www.broadinstitute.org/igv/ IGV user guide http://jura.wi.mit.edu/bio/education/hot_topics/igv/IGV.pdf https://www.youtube.com/watch?v=5kkPnCV06dE VIDEO https://www.youtube.com/watch?v=YeoHJFHnCrw

Reference Human Genome The first Human Genome Sequence that has been accomplished by the 13-years-long International Project and published in 2001 became the first Reference Human Genome. Under this project scientists sequenced diploid genomes of 6 anonymous people. Thus the reference genome is the mosaic of most frequent alleles found in these six individuals. Initial sequence had a number of sequencing errors and unreadable regions (gaps). Therefore reference genome is regularly updated by the Genome Reference Consortium. Newest releases are more accurate, have less and less gaps and more annotations. It is important to know the exact version of the Reference Genome (e.g. GRCh38.7 from March 2016) when individual genome is represented as a table of differences with the reference one (VCF format). Additional information from Genome Reference Consortium https://www.ncbi.nlm.nih.gov/grc Wikipedia: https://en.wikipedia.org/wiki/Reference_genome

Examples of VCF files head100_1000genome_VCF VCFdenisovCHR19_1000lines It saves a lot of space to present a genome as a table that demonstrates only the difference between itself and the Reference Human Genome link1.4. This popular format of genome representation is known as Variant Call Format (VCF). Each of its line represents one genetic variant difference. An example of this format is exemplified in the files above.

VCF format http://gatkforums. broadinstitute VCF format http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it 6. How to extract information from a VCF in a sane, (mostly) straightforward way Use VariantsToTable. No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is. Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers. (Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

Problems with VCF (examples) /home/afedorov/1000GENOMES esv2663150_10lines.txt and Excel file (esv2663150_10linesV2.xlsx) vi resultJune8 (se nowrap) example with deletions esv2663150 deletion on chr 19 http://www.ncbi.nlm.nih.gov/dbvar/variants/esv2663150/ /home/afedorov/DENISOV zcat T_hg19_1000g.19.mod.vcf.gz |more rs59558746 at 544960

What are our chances of success? Exome sequencing? Quality data? Coverage? Compound heterozygote?

Our approach An example with Galaxy web recourse: watch YouTube (a nine minute video): https://www.youtube.com/watch?v=MbW_f4eZNKM (TopHat alignment with Galaxy)

BED files and other formats https://genome.ucsc.edu/FAQ/FAQformat (nice explanation) Example BED: example_lncRNAhg38.BED GVF format (paper GVFformat.pdf)

HOMEWORK ASSIGNMENTS SLIDE#2: QUESTION (homework #1) Do corresponding R1 and R2 sequences from the same read (identifier) overlap? How? What is the total length of the read? SLIDE#4: HOMEWORK #2: Download the gbk file for one of the human chromosomes. Describe “gbk” format of the human genome representation (write 0.5-1 page) SLIDE#5: DataAccessAgreementUT_BioinfoStudent.doc every student must sign and return (ultimate HOMEWORK).