Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Data collection for osteoarthritis, cardiovascular disease and longevity Serum parameters Cellular characteristics (biobank) Skin ageing Glycosylation Metabonomic Transcriptomic Genetic (GWAS/sequence) Epigenetic Data Integration
Genetic & Epigenetic analyses Biochem analyses Expression analysis metabonomic analysis Glycosylation Cell responses Joost Kok Erik vd Akker Kai Ye Statistical analysis
About me 1995 – 2003 B.S. and M.S. in biology and pharmaceutical science 2004 – 2008 PhD with Cum Laude at Leiden University. Thesis title: Novel algorithms for protein sequence analysis 2008 – 2009 Postdoc at European Bioinformatics Institute, collaborating with scientists in Sanger Institute Currently assistant professor at MolEpi
A Pindel approach for identifying indels in Next-Gen sequencing data Paired-end reads in Next-gen sequencing Indel detection algorithms Pindel Cancer genome project 1000 genomes project
Paired-end reads in Next Generation sequencing ~ insert size
SNP Mapping paired-end reads CNVs: copy number variations; INDELs: insertions and deletions; SVs: Structural variations
Gapped alignment for small indels ATCCGTATCACGGTCA-CAGATCAGTCCAGT ATCCGTATCACGGTCAGCAGATCAGTCCAGT indel
Read-depth for CNVs
Read-pair approach for SVs No Indel Deletion Insertion Sample Reference Sample Reference Sample Reference
Mapping paired-end reads read-pairs read-depth SNP or small indel
Mapping paired-end reads read-pairs read-depth SNP or small indel
test ref 1base - 1million bases Pindel: Deletions
18 May Pindel: Deletions ref Anchor
18 May ref Pindel: Deletions Anchor 2 x average distance
18 May ref Pindel: Deletions Anchor 2 x average distance Expected maximum deletion size + read length (36)
18 May reference Pindel: Deletions sample
18 May African male: NA18507 Bentley et al., Nature Gb of sequence ~4 billion paired 35-base reads After preprocessing: 56,161,333 pairs of one-end mapped reads Pindel – 142, bp insertions – 162,068 1bp-10kb deletions
18 May Deletion size distribution
Applications Cancer genome project 1000 genomes project
Cancer genome COLO-829 cells Normal ~30x paired-end 100bp reads Tumor ~40x paired-end 100bp reads Search for somatic (tumor specific) indels
1000genomes project Pilot 1: 180 people of 3 major geographic groups (YRI, CEU, CHB and JPT) at low coverage (~4x) Pilot 2: the genomes of two families (CEU and YRI, both parents and an adult child) with deep coverage (20x per genome) Pilot 3: sequencing the coding regions (exons) of 1,000 genes in 1,000 people with deep coverage (20x).