Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth,

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Yue Han and Lei Yu Binghamton University.
A Lite Introduction to (Bioinformatics and) Comparative Genomics Chris Mueller August 10, 2004.
Object Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition l Panoramas,
Introduction to Bioinformatics
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
COFFEE: an objective function for multiple sequence alignments
A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Locally Constraint Support Vector Clustering
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Bioinformatics and Phylogenetic Analysis
Expected accuracy sequence alignment
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Sensor Positioning in Wireless Ad-hoc Sensor Networks Using Multidimensional Scaling Xiang Ji and Hongyuan Zha Dept. of Computer Science and Engineering,
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Speaker: Bin-Shenq Ho Dec. 19, 2011
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Expected accuracy sequence alignment Usman Roshan.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Bacteria and antibiotics Page refs for this section = Textbook p
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Expected accuracy sequence alignment Usman Roshan.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
The rate of evolution Where selection pressures are high, the rate of evolution can be rapid.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
Learning to Align: a Statistical Approach
Multiple sequence alignment (msa)
A Hybrid Algorithm for Multiple DNA Sequence Alignment
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Genome organization and Bioinformatics
Finding Functionally Significant Structural Motifs in Proteins
Basic Local Alignment Search Tool (BLAST)
Clustering.
Presentation transcript:

Typing Staphylococcus aureus using the protein A gene Phaedra Agius – January, 2008, completed at RPI in New York in collaboration with Barry Kreiswirth, Steve Naidich, Kristin Bennett

Introduction What is staph? Typing methods and the spA gene The data Comparing Sequences Similarities and differences Hierarchical clustering Evaluating the results Multidimensional Scaling Conclusion

Staphylococcus aureus is a bacteria often living on the skin or in the nose of a healthy person. Staph can cause a multitude of infections, from skin infections to more deadly infections such as pneumonia and meningitis It can spread rapidly Some strains are resistant to antibiotics (MRSA)

Typing Methods Multi Locus Sequence Typing (MLST) is a well established typing method that looks at 7 house-keeping genes in staph. These are genes that are always turned on. Our method looks at just ONE gene – the spA gene.

The spA gene The spA gene contains information for making Protein A. The protein A in staph is a virulence factor. It inhibits white blood cells from ingesting and destroying the bacteria by acting as an immunological disguise.

Preprocessed DNA sequences of the spA gene AAA GAG GAAGACAACAACAAGCCTGGT AAA GAAGATGGCAACAAGCCTGGT AAA GAAGACAACAAAAAACCTGGC AAA GAAGATGGCAACAAACCTGGT AAA GAAGACGGCAACAAGCCTGGT AAA GAAGATGGCAACAAGCCTGGT X1 K1 A1 O1 M1 Q1 The spA DNA sequences can be preprocessed into a sequence of repeats, or cassettes. Instead of dealing with the long DNA sequences, we use these shorter preprocessed spa sequences X1-K1-A1-O1-M1-Q1 Note, first cassette has 27bp, the others have 24bp

Labeled data The MLST allelic profile is provided for each sequence 194 sequences labeled with their MLST type Spa sequences MLST labels

Comparing spa sequences T1-J1-M1-G1-M1-K1 T1-K1-B1-M1-D1-M1-G1-M1-K1 T1-M1-B1-M1-D1-M1-G1-M1-K1 T1-M1-D1-M1-G1-M1-M1-K1 U1-J1-F1-K1-P1-E1 T1-J1-F1-K1-B1-P1-E1 U1-J1-G1-F1-M1-B1 These ‘preprocessed’ sequences are highly conserved. How can we generate numbers from sequences that reflect the subtle differences and/or similarities between them?

Comparing spa sequences –Global alignment –Affine alignment –BCGS - Best common gap-weighted subsequence Weighting the sequence ends (B and E) Using these methods each spa sequence can be represented as a vector of similarity scores between itself and all the other sequences

Global alignment Costs: Gap =1, Mismatch = 1 C L O U D Y D A Y G * O * * A W A Y Distance: d = 5 Similarity: s = 2

Affine gap alignment Costs: Gap Initialization = 2, Gap =1, Mismatch = 1 U1 J1 G1 F1 B1 B1 B1 B1 P1 B1 Global T1 J1 * * B1 B1 B1 * * D Distance = 8 Similarity = 4 U1 J1 G1 F1 B1 B1 B1 B1 P1 B1Affine T1 J1 * * * * B1 B1 B1 D Distance = 7 Similarity = 3

BCGS-Best Common Gap-weighted Subsequence P A R T Y H A R D P A N T * * * R Y Common subsequences are: S 1 = A, T, R, S 2 = AT, S 3 = TR, S 4 = ATR Gap weighted scores: Choose a weight 0< ג<=1 S 1 = 1 ¸ 0 = 1, S 2 = 2 ¸, S 3 = 2 ¸ 3, S 4 = 3 ¸ 4

If ג =1, then S4 is the optimal choice. If ג =0.9, the scores are 1, 1.8, 1.46 and 1.97 respectively If ג =0.8, the scores are 1, 1.6, 1.02 and 1.23 respectively S 1 = A, T, R, S 2 = AT, S 3 = TR, S 4 = ATR S 1 = 1 ¸ 0 = 1, S 2 = 2 ¸, S 3 = 2 ¸ 3, S 4 = 3 ¸ 4

Normalizing the similarity scores The similarity scores M are normalized as follows: where n 1 and n 2 are the sequence lengths Example: C L O U D Y D A Y G * O * * A W A Y Similarity = 3, Normalized similarity = 3/√(7*4)=0.57

B and E The cassettes at the beginning (B) and end (E) of a sequence are highly conserved within spa families These cassettes shall be compared separately, scored as a match (1) or mismatch (0) and weighted B E M=middle Let B and E have a weight of 20% in the overall score Sim score = 0.2*B + 0.6*M + 0.2*E

Similarities  Distances Normalized similarity scores can be transformed to distances as follows: Spa sequence  vector of distances between that sequence and every other sequence in the dataset. The set of spa sequences is now represented by a (normalized) distance matrix. D ( s 1 ; s 2 ) = 1 ¡ s i m ( s 1 ; s 2 )

Hierarchical Clustering Uses a distance matrix It iteratively ‘merges’ the two nearest items/clusters Cutoff c … this determines the number of clusters to be formed

Training and Testing Split the data into two – a TRAINING set and a TEST set Build a model on the Training set by choosing optimal B, E and c parameters Assign the Test data to the nearest clusters Evaluate the results Repeat multiple times for validation Train Test

Assigning Test sequences to the Training clusters We define the distance between a point and a cluster to be the mean of the distances between that point and the members of the cluster. IF the distance between a test point and the nearest cluster exceeds an outlier threshold t, the test point is defined to be an outlier (a novel strain of the bacteria) ELSE the test point is assigned to the nearest cluster. >t

Evaluation Compare our clusters to the groups defined by the MLST labels via the Jaccard coefficient Split our data into a Training and Testing set multiple times and measure the consistency of the clusters formed via a Stability score Measure the Accuracy of our spa groups by comparing them to the MLST groups

Jaccard coefficient Clustering S Clustering M

Stability The stability is measured over the n Training and Testing iterations. It is defined to be the mean of the Jaccard scores measured pairwise between the spa clusterings obtained at each iteration Spa clustering 1 Spa clustering 3 Spa clustering 2 J1J1 J2J2 J3J3 Stability = mean(J 1,J 2,J 3 ) Iterations 1, 2, 3 ….

Accuracy Spa group MLST group The MLST label assigned to a spa group is the label of the MLST group with which the spa group has the largest intersection. The accuracy for that spa group is defined to be the percentage of correctly labeled points. The overall accuracy of a spa clustering is defined to be the percentage of correctly labeled points. Accuracy = 8/11

Results: Jaccard scores (40 iters, outlier threshold = 1.5 sd)

Results: Stability scores (40 iters, outlier threshold = 1.5 sd)

Results: Accuracy scores (40 iters, outlier threshold = 1.5 sd)

Results: Outlier detection (40 iters, outlier threshold = 1.5 sd)

Results: Varying the Outlier threshold (10 iters, test set size = 30%)

Multidimensional Scaling (MDS) MDS translates a distances matrix to a set of coordinates such that the distances between the points are approximately equal to the dissimilarities. Picture taken from Forrest W. Young’s paper ‘Multidimensional Scaling’

MDS with our distances

MDS – a closer look

Conclusion and future work The Spa clustering method can refine groups in ways that MLST cannot BCGS worked best MDS on our spa distances clearly draws out the clusters Future research More data, compare to other typing methods Use BCGS on other data types Different distance measures Different ways of assigning test points to clusters Better ways for finding the optimal parameters other than a grid search

References Spa Typing method for Discriminating among Staphylococcus aureus Isolates: Implications for Use of a Single marker to Detect Genetic Micro and Macrovariation Larry koreen, Srinivas Ramaswamy, Edward Graviss, Steven Naidich, James Musser and Barry Kreiswirth Evaluation of protein A Gene Polymorphic Region DNA Sequencing for Typing of Staphylococcus aureus Strains B. Shopsin, M. Gomes, S.O. Montgomery, D.H. Smith, M. Waddington, D.E. Dodge, D.A.Bost, M. Riehman, S. Naidich and B. Kreiswirth Introduction to Computational molecular Biology Joao Setubal and Joao Meidanis Kernel Methods for Pattern Analysis John Shawe-Taylor and Nello Cristianini Framework for kernel regularization with application to protein clustering Fan Lu, Sunduz Keles, Stephen J. Wright and Grace Wahba

This work is published in IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4, Issue 4, Oct.-Dec Page(s): Thanks! Questions?