Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Advertisements

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
By: Katie Adolphsen, Robin Aldrich, Brandon Hu, Nate Havko.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
High Throughput Sequencing
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Next generation sequencing Xusheng Wang 4/29/2010.
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Large-Scale Copy Number Polymorphism in the Human Genome J. Sebat et al. Science, 305:525 Luana Ávila MedG 505 Feb. 24 th /24.
Whole Exome Sequencing for Variant Discovery and Prioritisation
De-novo Assembly Day 4.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Todd J. Treangen, Steven L. Salzberg
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Fuzzypath – Algorithms, Applications and Future Developments
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
Computational Laboratory: aCGH Data Analysis Feb. 4, 2011 Per Chia-Chin Wu.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
From Reads to Results Exome-seq analysis at CCBR
Short Read Sequencing Analysis Workshop
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
CAP5510 – Bioinformatics Sequence Assembly
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
2nd (Next) Generation Sequencing
Molecular Diagnosis of Autosomal Dominant Polycystic Kidney Disease Using Next- Generation Sequencing  Adrian Y. Tan, Alber Michaeel, Genyan Liu, Olivier.
Genomic alterations in breast cancer cell line MDA-MB-231.
Next-generation DNA sequencing
Introduction to Sequencing
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
Presentation transcript:

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:  Project Background  Why De novo Assembly  The New Phusion Pipeline  Kmer Words Hashing  Relational Matrix  454 Reads and Assembly  Cancer Genome Assemblies from Solexa Reads  Variations between Cell Samples

ICGC- International Cancer Genome Consortium

Large-Scale Studies of Cancer Genomes  Johns Hopkins > 18,000 genes analyzed for mutations 11 breast and 11 colon tumors L.D. Wood et al, Science, Oct  Wellcome Trust Sanger Institute 518 genes analyzed for mutations 210 tumors of various types C. Greenman et al, Nature, Mar  TCGA (NIH) Multiple technologies brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma). F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007

Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

Melanoma-Skin Cancer Disease

Sequencing COLO-829 on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.2, 2, 3, 4 kb Sequencing 75 bp reads short insert 50 bp reads long insert Alignment using bwa, ssaha2 Somatic mutations Germline variants Sequencing performed at Illumina De novo Assembly

Read Coverage COLO X tumour 32X normal

Why De novo Assemblies  Reference is not complete There are hundreds of contigs in the current form of human genome reference and the sequence representation is only ~90%;  Reference is mosaic The DNA samples of the current reference were from 8 individuals, although there is a dominant individual, representing > 80%;  Limitations of alignment against reference Using read alignment, it can reliably call SNPs and short indels, where the indel length is dependent up the read length. But it is very hard to find structural variants, particularly long novel insertion elements;  Genomes without references  Loss of one haplotype in a diploid sample

De Bruijn vs Read overlap

New Phusion Assembler Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100

Repetitive Contig and Read Pairs Depth Depth Depth Grouped Reads by Phusion

Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing

Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve

Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC10h06.p1c ACAGAAAAGC12a04.q1c ACAGAAAAGC13d01.p1c ACAGAAAAGC16d01.p1c ACAGAAAAGC26g04.p1c ACAGAAAAGC33h02.q1c ACAGAAAAGC37g12.p1c ACAGAAAAGC40d06.p1c ACAGAAAAGG16a02.p1c ACAGAAAAGG20a10.p1c ACAGAAAAGG22a03.p1c ACAGAAAAGG26e12.q1c ACAGAAAAGG30e12.q1c ACAGAAAAGG47a01.p1c High bits Low bits 64 -2k 2k

… j … N i N R(i,j) Relation Matrix: R(i,j) – number of kmer words shared between read i and read j Group 1: (1,2,3,5) Group 2: (4,6)

… j … R(i,j) Relation Matrix: R(i,j) – Implementation Read index Number of shared kmer words (< 63) N......

Number of reads: m; Total number of bases:35.9 Gb Reference genome size: 3.0 Gb; Sequencing platform: FLX&Titanium Read length: bp; Average read length:224 bp; Estimated read coverage: ~10X; Number of reads uniquely placed: m; Ratio of uniquely placed reads:95.0%; Vector sources:Unknow Stats of 454 Reads – NA12878

Contigs: Total assembled bases:2.78 Gb Number of contigs: 526,437; Average contig length:5,280 Contig N50:11,000; Largest contig:85,538;Supercontigs: Total assembled bases:3.17 Gb Number of contigs: 54,487 Gb; Average contig length:58,263 Contig N50:1,122,317; Largest contig:8,015,559; Stats of The Assembly

Paired Reads Separated by “NN”

Error Bases Correction

Solexa reads : Number of reads: 557 Million; Finished genome size: 3.0 GB; Read length:2x75bp; Estimated read coverage: ~25X; Insert size: 190/ bp; Number of reads clustered:458 Million Assembly features: - contig stats Total number of contigs: 1,020,346; Total bases of contigs: Gb N50 contig size: 8,344; Largest contig:107,613 Averaged contig size: 2,659; Contig coverage over the genome: ~90 %; Mis-assembly errors:? Genome Assembly – Normal Cell

Solexa reads : Number of reads: 562 Million; Finished genome size: 3.0 GB; Read length:2x75bp; Estimated read coverage: ~25X; Insert size: 190/ bp; Number of reads clustered:449 Million Assembly features: - contig stats Total number of contigs: 1,249,719; Total bases of contigs: Gb N50 contig size: 6,073; Largest contig:72,123 Averaged contig size: 2,152; Contig coverage over the genome: ~90 %; Mis-assembly errors:? Genome Assembly – Tumour Cell

Alus : ~300bp LINEs : ~6000bp Deletions – Normal Cell

Alus : ~300bp LINEs : ~6000bp Deletions – Tumour Cell

Tumour Specific Indels Number of Deletions: 18,449 Number of Insertions: 15,899 The numbers seem to be more than what should be expected: deletion/insertion; Experimental validation: ?

Acknowledgements:  Jim Mullikin  Yong Gu  Tony Cox  Elizabeth Murchuson  Erin Preasance  Mike Stratton