Genome STRiP ASHG Workshop demo materials Bob Handsaker October 19, 2014
Running Genome STRiP directly on AWS
Cloud demo: Genome STRiP command line StarCluster Cloud Storage Sequencing data Amazon Web Services Genome STRIP
Cloud computing scenarios Why are people interested in Genome STRiP on the cloud? Increase compute and storage capacity for large-scale processing Large genome studies Economical and with short lead time Utilize data sets that are stored in the cloud Public data sets (e.g. 1000 Genomes) Data sharing with collaborators No need to download bulky data to each site
Cookbook recipe: Genotyping in 1000 Genomes Phase 1 Inputs A site VCF file describing the variants (e.g. large deletions) to genotype Outputs Genotype VCF file Plots for quality control 1000 Genomes Data You choose the BAM file location: Cached copy on Amazon S3 storage HTTP from NCBI or EBI StarCluster Uses the StarCluster software from MIT for Amazon EC2 provisioning http://star.mit.edu/cluster
Demo Show input vcf file in local directory starcluster put gs-cluster example.vcf example.vcf starcluster sshmaster gs-cluster ./genotype-sites.sh example.vcf run1 (show output) (log out) starcluster get gs-cluster run1 run1 Show vcf in textedit Show genotyping plot pdf
Cloud computing support in Genome STRiP Remote BAM file access Support for multiple file access protocols in addition to local files HTTP / HTTPS FTP Amazon S3 protocol Pre-computed metadata for 1000 Genomes Phase 1 and Phase 3 Eliminates the need to run Genome STRiP preprocessing Avoids the need to download the 1000 Genomes BAM files Metadata is relatively compact: 5Gb (Phase1) and 13Gb (Phase 3) ftp://ftp.broadinstitute.org/pub/svtoolkit/public_metadata/ Cookbook recipes for common scenarios Genotyping variants in 1000 Genomes samples
Genome STRiP cookbook
Sample genotyping output Standard VCF file with sample genotypes ##fileformat=VCFv4.1 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00100 20 3821195 DEL_2_99615 A <DEL> . . END=3825137 GT:FT:GQ 0/0:PASS:71 0/1:PASS:14 Genotyping plot for visual verification Histogram of normalized read depth Colors indicate confident calls (gray samples are below 95% confidence) Small numbers on plot indicate evidence from read pairs or split reads
Command summary starcluster start gs-cluster -s 1 starcluster put gs-cluster example.vcf example.vcf starcluster sshmaster gs-cluster ./genotype_sites.sh example.vcf run1 starcluster get gs-cluster run1 run1 starcluster terminate gs-cluster Launch Amazon compute cluster Copy input file from local to cloud Log in to remote cluster Run genotyping command script Copy output files from cloud to local Shut down compute cluster
For more information …. Bonus evening session Tonight (Monday) 6:30 – 8:00 PM Room 24, Upper Level Web site http://www.broadinstitute.org/software/genomestrip Support forum (Genome STRiP topic in GATK forum) http://gatkforums.broadinstitute.org/categories/genomestrip AWS Support In Genome STRiP Seva Kashin Poster 603 T (Tuesday afternoon) Multi-allelic copy number variation in humans Early look at upcoming Genome STRiP functionality for duplications and multi-allelic CNVs
Intro Slides for Gabor
Genome STRiP Genome STRucture in Populations Integrates multiple features of sequence data with population-based patterns across many individuals Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet 43, 269-76 (2011)
Genome STRiP Structural variation analysis from sequence data Integrative Combines multiple feature of the sequence data (read pairs, read depth, split reads) Integrative approaches have consistently shown higher accuracy Population-aware Increases power and accuracy Particularly important for low-coverage genomes Modular architecture Discovery of new variants Genotyping of newly discovered variants and/or known variants Includes tools for QC / analysis Initial prototype developed for analyses in 1000 Genomes Project Low false discovery rate and high sensitivity
Demo Slides