Download presentation
Presentation is loading. Please wait.
Published byFelicity Hubbard Modified over 9 years ago
1
Gao Song 2010/04/27
2
Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work
3
Concepts Contig: Edge (PET): library size Scaffolding: a sequence of contigs Happy Edge: Real distance <= expected distance Orientation of both contigs are correct
4
Problem Definition Version 1: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges Version 2: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges and is also the optimal solution
5
Non-error Case Connected graph Partial Layout: Dangling Edge: only one end in partial layout Active region: the sequence from the first contig having dangling edges to the end of partial layout; less than library size Domain of a partial layout: all nodes in partial layout
6
Non-error Case Theorem: if two partial layout l1 and l2 have same active region and dangling set, then (1) they have same domain (2) both or neither of them can extend to a solution Proof:
7
Procedure Find the unassigned node Select the nearest node as next assigned node Update current partial layout Remove all dangling edges incident to new node Add new dangling edges of new node Remove contigs from active region
8
Main Procedure Find all nodes which has no ancestors and select one to start From an active region, get all unassigned nodes, and update the partial layout Remember all visited partial layout If dangling edge set is empty, output the results
9
Time and space complexity Two possibilities k vertices in active region – one possible next nodes Less than k vertices in active region – n possible next nodes Comlexity O(n k )*O(1) O(n k-1 )*O(n) Total time complexity: O(n k ) Total space complexity: store all visited partial order
10
Introduce Edge Error Types of edge error Chimeric PETs: Mapping error Misassembled contigs Solution Filtering – filter chimeric PETs Select x% of PETs Shuffle them to get chimeric PETs Cluster them to find threshold Local threshold............
11
Introduce Edge Error There are p unhappy edges in final scaffolding Partial layout Dangling edges: real dangling edges; wrong edges
12
Equivalent Class Active region, dangling edges’ set, count of current wrong edges Same domain Assumption: the partial order is a connected graph
13
Get Unassigned Nodes Sort the unassigned nodes Properties of nodes: Steps to reach this node Distance to the end of active region Unhappy edges introduced due to this node
14
Sort Unassigned Nodes Breadth-first search Select the smallest possible distance: > threshold Sort nodes: Less than 5 steps, compare with distance; same distance, compare with unhappy edges
15
Update Partial Layout Check if all incident un-wrong dangling edges are happy If yes, just remove all those edges and add new node If no, check if setting all unhappy edges as omitted will result in disconnected graph If no, just add new node and remove dangling edges If yes, discard current partial layout – to avoid insert disconnected component into sequence Add new dangling edges Remove all dangling edges which is not happy – check connectness
17
Main Procedure If active region is empty Current connected component is finished Check if dangling edge set is empty If yes, output the result If no, using dangling edges to find a new node and start another scaffolding
18
Disconnected Components First find all the connected components and sort them according to the number of nodes From the first component, find a solution, which omits p1 edges For ith component, if there is no solution omits p- sum(p1,…, pi-1) edges, remember all the stop point, return to (i-1)th component, and see if it can find a solution which omits less than pi-1 edges. If yes, continue from the stop point of ith component.
19
If ith component finishes the whole search and found more than one solutions. Then, only remember the solution with minimum pi. Then, in the future, when comes to this component, just use this solution as part of the partial results
20
Optimal Solution Branch and Bound P’ edges
21
Simulated Data Result Node Num: 1522 nodes Contig length: 600 - 10,000 Wrong edgespTime(ms) 002765 112984 224984 336562 447000 557328 667281 777343 887406 9951813 10 216984
22
Future Work Find the optimal solution Wrong contigs Repeats How to deal with large p Find a good way to sort the unassigned nodes
23
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.