Jan Pačes Institute of Molecular Genetics AS CR hard assembly Jan Pačes Institute of Molecular Genetics AS CR
problems genomes high GC content repetitions (short - low informational content, long) polymorphic "unreadable" sequences, "weird" structures technologies nonrandom libraries wrong sizes erroneous or chimeric reads
sequencing technologies ABI (sanger) 454 (pyrosequencing) solexa (reversible terminator) SOLiD (2base ligation) PacBio (SMRT)
example of errors in one technology http://chevreux.org/mira_ex_454sanger.html
high GC regions are underrepresented Aird et al. Genome Biology 2011
protocol optimization for high GC content Aird et al. Genome Biology 2011
repetitions scaffold repetition
repetitions
repetitions recognition Repeatmasker http://www.repeatmasker.org/ RepeatModeller (RECON and RepeatScout) http://www.repeatmasker.org/RepeatModeler.html position aware assemblers MIRA http://sourceforge.net/projects/mira-assembler/ MaSuRCA http://www.genome.umd.edu/masurca.html SPAdes http://bioinf.spbau.ru/spades
k-mer distribution
k-mer analysis JELLYFISH - Fast, Parallel k-mer Counting for DNA http://www.cbcb.umd.edu/software/jellyfish/ Quake is a package to correct substitution sequencing errors in experiments with deep coverage http://www.cbcb.umd.edu/software/quake/ KHMER Trim off likely erroneous k-mers https://khmer-protocols.readthedocs.org/en/v0.8.2/
repetitions repetition scaffold
filling gaps GapCloser (part of SOAPdenovo) http://soap.genomics.org.cn/soapdenovo.html GapFiller (part of SSPACE) http://www.baseclear.com/lab-products/bioinformatics-tools/gapfiller/ GapFiller http://sourceforge.net/projects/gapfiller/
454 multiplicates
contig coverage by large libraries
illumina pe and mate-pairs libraries 1616 illumina pe and mate-pairs libraries
highly polymorphic genomes two copies of polymorphic contigs scaffold
polymorphic assembly workflow normal assembly condensing alternative contigs mapping to identify SNPs "repair" reads second "polymorpic" assembly http://www.fishbrowser.org/software/L_RNA_scaffolder
G-quadruplex
Chicken p53 – coverage from RNAseq data AGCGACCCCCCCCCACCACCGCCACCACCACCTCTGCCATTGGCCGCCGCCGCCCCCCCCCCATTAAACCCCCCCACCCCCCCCCGCGCTGCCCCCTCCCCGGTGG Coverage > 13,000X
Chicken erythropoietin (EPO)– coverage from RNAseq data CCCGCCCACCCCCACCCCCACCCGCACCCCCCACTCTCCCACCCCCACCCCCTTTTCTCCCACCCCCTCTTCTCCCACCCCCTTTTCCCCCCCTTCCTCCCCCCACTCCG CCCCCCCCCCGCCCCCTCCCCCCCCCCAGGTGAGGACCCT Coverage > 500X from RNAseq (*EPO locus not completed even from 1000X coverage genomic Illumina data!)
chicken missing genes
that’s it, thank you many thanks also to: Daniel Elleder Tomáš Hron Michal Kolář Hynek Strnad