Keep all significant matches

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
1 Section 9.4 Spanning Trees. 2 Let G be a simple graph. A spanning subtree of G is a subgraph of G containing every vertex of G –must be connected; contains.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
The Cobweb of life revealed by Genome-Scale estimates of Horizontal Gene Transfer Fan Ge, Li-San Wang, Junhyong Kim Mourya Vardhan.
Mutual Information Mathematical Biology Seminar
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
Working with the Conifer_dbMagic database: A short tutorial on mining conifer assembly data. This tutorial is designed to be used in a “follow along” fashion.
40S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Probability Lesson: PR-4 Fundamental Counting Principle Fundamental Counting Principle Learning.
 Median- middle number. If it is a even numbered list, take the middle two numbers, add them and divide by two.  Mean- average, add list of numbers.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Chapter 9 (modified) Abstract Data Types and Algorithms Nell Dale John Lewis.
9-1 Abstract Data Types Abstract data type A data type whose properties (data and operations) are specified independently of any particular implementation.
Chapter 8 Molecular Phylogenetics: Measuring Evolution.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Sets and Whole Numbers 2.1 Sets and Operations on Sets 2.2 Sets, Counting, and the Whole Numbers 2.3 Addition and Subtraction of Whole Numbers 2.4 Multiplication.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
INTRODUCTORY LECTURE 3 Lecture 3: Analysis of Lab Work Electricity and Measurement (E&M)BPM – 15PHF110.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Most Likely Rates given Phylogeny L[  |001] =  0 x (P[  A ] + P[  B ]) +  1 x (P[  C ] + P[  D ])
3.4 Elements of Probability. Probability helps us to figure out the liklihood of something happening. The “something happening” is called and event. The.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Definition Slides Unit 2: Scientific Research Methods.
Definition Slides Unit 1.2 Research Methods Terms.
Advanced Sorting 7 2  9 4   2   4   7
Different Types of Data
Data Structures Using C++ 2E
Information Retrieval in Practice
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Hashing Alexandra Stefan.
Frequent Pattern Mining
Data Structures Using C++ 2E
Date of download: 1/1/2018 Copyright © ASME. All rights reserved.
Significant Figures
Advanced Associative Structures
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Predict Protein Sequence by Fuzzy-Association Rules
This tutorial is designed to be used in a “follow along” fashion
Data Mining Association Analysis: Basic Concepts and Algorithms
Volume 3, Issue 1, Pages (July 2016)
VCE IT Theory Slideshows
Panagiotis G. Ipeirotis Luis Gravano
Dictionaries 4/5/2019 1:49 AM Hash Tables  
Homing sequence: to identify the final state.
Repeated Measures Balancing Practice Effects with an Incomplete Design
Robust Inference of Identity by Descent from Exome-Sequencing Data
For First Place Most Times Up at the Table
Redundancy in the Population Code of the Retina
Hashing.
What we learn with pleasure we never forget. Alfred Mercier
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Vocab unit 2 Research.
ECE 352 Digital System Fundamentals
Association Analysis: Basic Concepts
Presentation transcript:

Keep all significant matches Bootstrapping produces 1000 trees Example of a tree broken into clusters of sequences F G Break trees into clusters Each bootstrap tree is broken into clusters of sequences. This is repeated for all bootstrap trees. E A C B D Generate “probes” Every combination of sequences seen as a cluster in any bootstrap tree will be used as a search “probe” . In the boxed example, (A B) and (C D E F G) would be among the probes. Cluster # A B C D E F G 1 * * * * * * * 2 * * * * * 3 * * * * * 4 * * * * 5 * * * 6 * * Search each tree using probes Each probe is compared with sequence clusters in every bootstrap tree. How well a probe matches a cluster is measured by the formula: Score = number of sequences common to the probe and the cluster  total number compared. The higher the score, the better the match. For each probe, the highest scoring cluster in every tree is recorded. 7 * * 8 * * Probe Best match Score (ABCDEFG) (ABCDEFG) 1 (ABCD) (ABC) 0.75 (ABEFG) (ABCE) 0.5 (CDEFG) (DEFG) 0.8 (EFG) (FG) 0.66 (AB) (AB) 1 (CD) (DFG) 0.33 (ABC) 0.33 (FG) (FG) 1 Refine probes High-scoring clusters will not always include every sequence that appears in the probe (for example, probe ABEFG might score highest against cluster ABFG; sequence E is missing). For each probe, sequences that appear in fewer than 95% of the clusters are removed. E D F G B A C A sample comparison between probes and a second tree, showing the highest scoring clusters. Count how many times each probe appears in its entirety The % of high-scoring clusters in which all remaining probe sequences appear is an index of the probe’s significance. Example showing the final refined probe (A B E) which is present in all high scoring clusters shown. Determination of significance threshold The significance threshold for this method was determined empirically by simulation (see figure caption). Refined probes whose sequences were seen in  90% of high-scoring clusters were shown to be  99% accurate. Tree # A B C D E F G Probe 2 * * * * * 1 * * * * * 2 * * * * 3 * * * * * * 4 * * * * …. * * * * 95% * * * Keep all significant matches

Bootstrapping produces 1000 trees H B G C F Choose a set of major clades The choice depends on the specific phylogenetic question we wish to address. G H Example of a tree divided into subtrees C D Nominate one sequence to represent each clade One sequence from each major clade (e.g. plant, intracellular etc.) is selected. A B Distribution of the sequences A, D, E & F in the subtrees shown above Slice Seq. H B G C A * D E F Divide up the trees All bootstrap trees are divided into subtrees, each containing one of the nominated sequences. This division is unique. Measure the distribution of sequences in subtrees, among all bootstrap trees. Slice Seq. H B G C A - 86 14 D 2 3 95 E 77 10 8 5 F 23 75 Report sequence partitioning Where a sequence occurs in a defined portion of the tree (half of the subtrees or less) 95% of the time, this is reported. Those sequences which occur in a single slice are ignored, as these would have been identified using the comparison method. Final distribution of sequences in all bootstrap trees. For example, sequence A spends 86% of the time in subtree B, and 14% of the time in subtree G. The relationships in red will be reported; the one in blue would have been picked up already by the comparison method.