Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert A machine-learning approach to combined evidence validation of genome assemblies Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert John K. Colbourne Presented By – F A Rezaur Rahman Chowdhury
Mate-Pair Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp CLONE & END SEQUENCE 10,000bp
Assembling the fragments NOte that contig orientation/order is not determined
Building Scaffolds Break DNA into random fragments Sequence the ends of the fragments Assemble the sequenced ends Build scaffolds We need to determine the relative order/orientation of contigs Using forward-reverse constraints helps
Assembly Overview Assembly Scaffolding
Assembly Algorithms Greedy (TIGR , phrap, CAP3) De bruijn Graph Graph based Greedy(Celera, Arachne) Euler path based
Statistical Error detection Significant deviations from average coverage.
Distribution of clone length
Good and Bad Clones intra-contig or intra-scaffold clone is called good if the absolute Z- score of its length is smaller than a threshold Half-placed clones also bad Clones with paired-end reads that are placed in the same or outer orientation
Machine-learning approach Combine evidence assembly validation Features are taken from the statistical approaches Five different classifier (J48, RF, RT, NB, BN)
Evaluation Simulated dataset with different error rates ( 0.001, 0.003, 0.005) Draft Assembly of Drosophila (D. mojavensis, D. erecta and D. virilis)
ROC Curve
Simulated Data
Drosophila Assembly
Cross Species
Questions?