Presentation is loading. Please wait.

Presentation is loading. Please wait.

GENOME ASSEMBLY Candidatus Carsonella Ruddii. Problem: How can Eulerian graphs be used to assemble a genomic sequence? ■Real life scenario: multiple copies.

Similar presentations


Presentation on theme: "GENOME ASSEMBLY Candidatus Carsonella Ruddii. Problem: How can Eulerian graphs be used to assemble a genomic sequence? ■Real life scenario: multiple copies."— Presentation transcript:

1 GENOME ASSEMBLY Candidatus Carsonella Ruddii

2 Problem: How can Eulerian graphs be used to assemble a genomic sequence? ■Real life scenario: multiple copies fragmented at random points, with repeats and missing regions. ■I simulated my own ‘reads’ departing from Candidatus Carsonella Ruddii – one of the smallest genomes. –Full genome available at NCBI ■Data simulation: 2 programs, kmerComp, readsDict kmerComp input: string, integer k (lentgh of k-mers) output: dictionary (unordered) [values are k-mer composition list]

3 Data simulation (cont.) ■readsDict Input: string, integer k(length of k-mer), integer c (number of copies) Output: dictionary of reads, some k-mers may be missing and some repeated. My aim was to replicate what the experimentally obtained reads may look like: “c” corresponding to the number of copies of the original DNA. An embedded FOR loop run through each of the copies (first FOR loop runs through the kmerComp c times), selecting a number of k-mers from it. Example from k-mer dictionary: 0: 'ATGAATAATATTTTTGCAAAAATAA', 1: 'TGAATAATATTTTTGCAAAAATAAC', 2: 'GAATAATATTTTTGCAAAAATAACT', 3: 'AATAATATTTTTGCAAAAATAACTG', 4: 'ATAATATTTTTGCAAAAATAACTGC', Example from reads dictionary: 0: 'TTTTTTTTTTTAAAAAAAAAAATAT', 1: 'CTAATAGAAAAATAATTTTTTATTT', 2: 'GAACAAAATGATATAAAAAAAATTA', 3: 'TATGTGCTGGGACTTTTATTAATTC', 4: 'TTTAATTTAACAATGGAAAAACAAA',

4 Steps towards genome assembly: ■adjList (Eulerian graph) Input: list of k-mers Output: dictionary, each prefix is paired with corresponding suffix ADJACENCY LIST FROM K-MERS: 'TTTTGTGTTGGAAAATAATGATTT': 'TTTGTGTTGGAAAATAATGATTTA, 'TTGCAGGAATAAATGCAGCTAGAA': 'TGCAGGAATAAATGCAGCTAGAAA ADJACENCY LIST FROM READS: TTGCAGGAATAAATGCAGCTAGAA': 'TGCAGGAATAAATGCAGCTAGAAA' 'GCTAAAAATATAATTTTATGTGCT': 'CTAAAAATATAATTTTATGTGCTG'

5 Genome assembly (cont.) ■StringR() Worked only for k-mer composition (when each k-mer had only one possible path) Finds first term (that which is suffix for nothing) and follows path from there. ■EulerianCycle Input: adjacency list (dictionary) Output: list of points (trajectory in the graph) FOR LOOP – start at each of possible unused edges; modify list unusedEdges everytime. Update a second list {edge list] with each point it tries out. Embedded WHILE LOOP – runs through unusedEdges until there are no more of them. 2 options: a point can be followed only by another, a point can be followed by more than one (choose randomly). At the end of the WHILE Loop, print EdgeList.

6 Results and Evaluation ■Difficulty in dealing with incomplete or ‘excessive’ information. ■Difficulty in taking ‘random’ decisions. –In the program eulerianCycle – I was unable to deal with the ‘mistakes’ or everytime the program randomly took a path that was shorter than the number of edges. It would just give me an error. ■I had expected the problems in the book chapter to more clearly be applicable to the sample genome.


Download ppt "GENOME ASSEMBLY Candidatus Carsonella Ruddii. Problem: How can Eulerian graphs be used to assemble a genomic sequence? ■Real life scenario: multiple copies."

Similar presentations


Ads by Google