Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.

Similar presentations


Presentation on theme: "Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye."— Presentation transcript:

1 Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

2 Two problematic libraries  M19107 & M21639 were resequenced due to low coverage  M19107 still had low coverage after re-seq  M21639 coverage increased to ~80X

3 De novo assembly – exponential dec.

4 M21639 still bad

5 Mixed culture?  Is this a mixed culture?  What does a mixed culture assembly look like?  newProject, addRun, addRun, runProject …  M19501 & M21127  M21127 & M21621

6 Mixed culture

7 Possible explanations for poor assembly  About 20% larger genome ◦ Recent plasmid or other large genome insertion  Also lost hemolytic ability and H 2 S production

8 De novo assembly – 36 is the best

9 De novo assembly  Limited by our data, 36 contigs is lower limit ◦ Number of repeat elements ◦ rRNA  20X coverage is sufficient

10 General assembly stats Genome Newbler de novo Newbler reference Mira3 de novo AMOScmp reference minimus2 - newbler/mira minimus2 - newbler/AMOScmp M19107.sff2171260208471123109 M19501.sff759881815212227 M21127.sff591013895383828 M21621.sff50986675152823 M21639.sff17512721755735469 M21709.sff52313831873732 M19107_1.sff13361361 M19107_2.sff4501006 M21639_1.sff2661165 M21639_2.sff1471282

11 Assembler Evaluation Strategy  Single assembler evaluation  Minimus2 assembler evaluation  Feedback by gene prediction group

12 Single Assembler Evaluation (“Hard Measurement”)

13 Single assembler evaluation  “Soft” measurement: satisfy gene prediction group's requirement.  RNA prediction group requires a file which can trace back the depth of the reads. For this, we use the.tsv file in the Newbler output.

14 454AlignmentInfo.tsv (-infoall/-info/-noinfo)  base consensus, quality, depth and flow-signal, at each position in each contig. A very useful file. eg: Position ConsensusQuality Unique AlignSignalSignal Score DepthDepthStdDev (incl. duplicates) >contig00008 1 G 64 26 320.98 0.05 2 A 64 27 330.94 0.13 3 T 64 27 331.97 0.14 4 T 64 27 331.97 0.14 5 G 64 27 330.97 0.06...etc... Newbler Output

15 Single Assembler Evaluation  For “hard” measurement: we focused mainly on “Total Contigs” “N50 Contigs Bases”, “Total Big Contigs”, “Big Contigs Percent Bases”, “Big Contig Reads”, “Singleton Reads”.  For “soft” measurement: we focused on the trace back of depth for each base pair.

16 Final Rank of Single Assembler  Combine the “hard” and “soft” measurement manually. We get as a result:  1. Newbler De Novo  2. Mira3  3. Amos  Eliminated : Newbler reference & velvet

17 Minimus2 Evaluation  Since our top choice is Newbler, we want to include Newbler’s results in the merged contigs. Thus, we analysed the statistics of: 1. Newbler merged with Amos 2. Newbler merged with Mira  Visulization tool: hawkeye

18 Minimus2 Evaluation

19

20 Gene prediction results feedback

21 Single assembler evaluation (Sample Data: contigs)

22 Single assembler evaluation (Sample Data: Big contigs)

23 Final Rankings of Single Assembler  Combined the “hard” and “soft” measurement manually and we got:  1. Newbler De Novo  2. Mira3  3. Amos  Eliminated : Newbler reference & velvet

24 Minimus2 Evaluation (sample data: contigs)

25 Minimus2 Evaluation (sample data: reads)

26 Minimus2 Evaluation  Using the same strategy as above, our rankings are:  1. Newbler merged with Mira;  2. Newbler merged with Amos;

27 Final Recommendation  1. Merged Contigs:  (1) Newbler merged with Mira  (2) Newbler merged with Amos  2. Single Assembler:  (1) Newbler  (2) Mira  (3) Amos

28 Feedback Strategy  Gene Prediction Group may use predicted genes and RNAs to evaluate our assembly results.

29 Minimus2 Overview  Minimus2 is a modified minimus pipeline  It is designed to merge one or two sequence sets hereafter referred to as S1 or S2  Uses Nucmer based Overlap Detector instead of the Smith-Waterman hash overlap Minimus1 uses (much faster)

30 Minimus2 Usage  minimus2 prefix \  -D REFCOUNT=n \ # Number of sequences is the first set  -D OVERLAP=n \ # Minimum overlap (Default 40bp)  -D CONSERR=f \ # Maximum consensus error (0..1) (Def 0.06)  -D MINID=n \ # Minimum overlap %id for align. (Def 94)  -D MAXTRIM=n # Maximum sequence trimming length (Def 20bp) ◦ Prefix refers to an.AFG filename

31 Minimus2 Usage, Cont’d.  REFCOUNT should be set to number of sequences in the first set  “all vs all” alignment is ran by default and sets REFCOUNT to 0 unless user-specified  S1 & S2 should be merged and converted to AMOS format using toAmos command  toAmos –s S1-S2.seq –o S1-S2.afg

32 Minimus2 Usage, Cont’d.  After you merge the data, you actually run minimus on it  Minimus2 S1-S2 –D REFCOUNT=##  Input  S1-S2.afg  Output  S1-S2.fasta (contig)  S1-S2.singletons.seq (single)

33 Nucmer Algorithm  A modification of the MUMmer package matching algorithm  Operates via building and then searching a suffix tree data structure  This is a significant upgrade from the minimus1 approach as searching using suffix trees is O(n) and minimus1’s method is O(n 2 )  Linear time versus polynomial time  MUMmer link : http://mummer.sourceforge.net/ http://mummer.sourceforge.net/

34 Nucmer Algorithm  The Nucmer strategy uses approximately 17 bytes of memory for each basepair in the reference sequence  The query supplied by the user is streamed past the reference suffix tree so that the memory requirements do not depend on the size of the query sequence  In English: Bigger query does not mean order of magnitude longer operating time  Unique algorithm that can be found and analyzed as it is open source on Sourceforge

35 MIRA  Based on OLC approach  Strategies: ◦ Preprocessing: high confidence region (HCR) ◦ Use a quick heuristic algorithm to alignment the HCR of reads ◦ Overlaps are reviewed with Smith-Waterman alignment algorithm ◦ Contigs can be be optionally analysed and corrected by an incorporated version of an automatic editor ◦ Repeats are resolved by searching for typical mis-assembly patterns ◦ Optional pre-assembly read extension step: the assembler can try to extend HCRs of reads by analysing the overlap pairs from the previous alignments.

36 MIRA outputs  d_results: this directory contains all the output files of the assembly in different formats.  d_info: this directory contains information files of the final assembly.  d_log: this directory contains log files and temporary assembly files.  d_chkpt: this directory contains checkpoint files needed to resume assemblies that crashed or were stopped (not implemented yet, but soon)

37 d_results  out.padded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process. Positions in the consensus containing gaps (also called 'pads', denoted by an asterisk) are still present.  out.unpadded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process, put positions in the consensus containing gaps were removed.  qual files  Outputs with other formats ◦ caf, ace, gap4d

38 d_info  info_assembly.txt: some statistics as well as whether or not problematic areas remain in the result.  info_callparameters.txt: This file contains the parameters as given on the mira command line when the assembly was started.  info_contigstats.txt: This file contains in tabular format statistics  info_contigreadlist.txt: This file contains information which reads have been assembled into which contigs.  info_readstooshort: A list containing the names of those reads that have been sorted out of the assembly only due to the fact that they were too short, before any processing started.  error_reads_invalid: A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).

39 Newbler  Another OLC assembler ◦ Starts with `indexing`  Scans the.sff file, trims the reads,  Performs some checks for possible 3’ and 5’ primers. ◦ Finds overlaps between reads  Splits the phase between long reads and short reads.  Alignments proceed using seed and extend. ◦ Simplifies overlap graph and generates consensus contigs.  Uses the quality information for base calling.

40 Newbler: Metrics file  454NewblerMetrics.txt ◦ runData  Total number of reads and bases in the file, also the number of reads and bases after trimming. ◦ runMetrics  Number of searches, seeds and overlaps during the alignment phase of assembly ◦ readAlignmentResults  Number of reads and bases aligned to other reads,  Inferred read error – No. of errors, mainly indels, between the contigs and the reads. ◦ consensusDistribution  This section deals with base calling of the consensus contigs. ◦ consensusResults  A summary of the read alignments and assembly statistics

41 Newbler: Contigs  454AllContigs.fna (>100 bp default)  454LargeContigs.fna (>500 bp default) ◦ The fasta header:  Gives the the unique contig number, its length in bp and the number of reads in the alignment used to build this contig ◦ lower case bases correspond to quality values below 40.  454Contigs.ace ◦ All contigs in.ace format. Useful for visualization using for e.g. Eagleview >contig00001 length=381 numreads=158 >contig00002 length=144 numreads=560 GGGAGAACTCATCTCTTGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCG CACTTAGCTACCCGGCAATGCGTCTGGCGACACAACCGGAACACCAGTGaTGCGTCCACT CCGGTCCTCTCGTACTAGGAGCAG >contig00002 length=144 numreads=56064 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 20 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

42 Newbler: The status files  454TrimStatus.txt: This file describes what (trimmed) part of the read was considered for alignment. ◦ read Id, trim points used, used trimmed length, raw length  454ReadStatus.txt: This file describes where reads ended up after assembly was complete. ◦ Id, Status (assembled, partially assembled, singleton, outlier, too short).  454AlignmentInfo.tsv: This file gives a consensus alignment overview for each position in each contig. ◦ Position, consensus, quality score, depth, signal, std. deviation

43 Automating assembly  Motivation ◦ Troublesome installation, number of dependencies ◦ Difficult to remember command line parameters  Automation ◦ install.sh : A script to install assemblers and their dependencies ◦ assembler.sh: Script to run assemblers with default arguments.

44 Future work  Continue to dialog with G.P. to determine assembly of choice.  Metrics: ◦ tRNA count ◦ rRNA count ◦ Protein coding regions

45 Questions?


Download ppt "Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye."

Similar presentations


Ads by Google