Download presentation
Presentation is loading. Please wait.
Published byΤίμων Παπαϊωάννου Modified over 6 years ago
1
Discovery of Prokaryotic Relationships through Latent Structure
Natalya Muzinich Advisors: Dr. John Paolillo Dr. Sun Kim
2
Importance of Studying Bacterial Relationships
Identification of regions distinguishing pathogenic from non-pathogenic strains important for general cell functioning Ex. Minimal gene set Ex. Non-coding RNA Better understanding of evolution lost genes re-arranged genes congruent evolution 11/10/2018
3
Common Computational Methodology in Relating Biological Organisms
Alignment BLAST Phylogenies Maximum likelihood Neighbor joining Bayesian Maximum Parsimony 11/10/2018
4
Problems with Existing Methods
Based on a subset of available genetic information Potentially leads to inaccurate results Regions non-coding for proteins are typically excluded from analysis They are shown to be important: “for decades the functions of RNA in the cells was grossly underestimated. It is difficult to assess the number of noncoding RNAs yet to be discovered. Unlike protein-coding genes…, RNA-coding genes are much more difficult to find within the genomic sequences.” (Szymanski et al, 2003) 11/10/2018
5
Problems with Existing Methods
Computationally expensive Problems Sequence length for BLAST Number of genomes for many methods of phylogenetic tree construction ex. runtime is in Ο(n!) for maximum parsimony Very limited number of genomes can be classified ex. Around 10 with maximum parsimony 11/10/2018
6
Project Goals Utilize information in the whole genomic sequence
Scalability Be able to compare a large number of genomes quickly 11/10/2018
7
Proposal: Divide-and-Conquer
Divide a complete genomic sequence into overlapping subsequences Construct genomic representation from the subsequences (cmp. shotgun sequencing) 11/10/2018
8
Representation Construction
Select subsequence length (7) Count all occurrences of every possible (47=16384) subsequence within a unit of genetic information Arrange the pairing the of subsequences with the frequency of their occurrence into a named vector Collect all vectors into a matrix Units can be chromosomes, plasmids, or complete genome. Depends on purposes of analysis 11/10/2018
9
Data Matrix 11/10/2018
10
Genomes Represented by the same subsequences
Differ in the distribution of the subsequences 11/10/2018
11
Proposal: Latent Structure Modeling
Reasons behind correlations? Statistical method capturing correlations in the observed data through unobserved (latent) variables Principle Component Analysis (PCA) Singular value decomposition (SVD) detects these correlations 11/10/2018
12
SVD Requires data arranged in a matrix
Values are correlations between the observed data (rows) and their representation (columns) In this project rows are heptamers, columns are genomic units (chromosomes and plasmids) 11/10/2018
13
SVD Decomposes any matrix into a product of three matrices: X=USVT
Derives the system of coordinates in a high dimensional space for the observed data (matrix U) its representation (matrix VT) S measures variance along each dimension 11/10/2018
14
Measuring Similarity: Coordinate System
Allows for the use of vectors for measuring of distance among the data points Distance measures euclidean cosine similarity In this project data is normalized. Euclidean distance works well n this case 11/10/2018
15
Proposal: Hierarchical Clustering
Input: factor scores computed from the results of SVD to derive a matrix F=(√S * VT)T Method: Ward’s Minimizes variance within each cluster Uses error of the sum-of-squares criterion (ESS) Principle Components minimize ESS Clustering correlates with PCA results 11/10/2018
16
Data and Related Tools Data Patdist program of Dr. Kim C program
58 bacterial genomes from NCBI genbank 83 total (plasmids, second chromosomes) log-transformed normalized (by row, Z-scores) Patdist program of Dr. Kim a list of the heptamers at every position in a genome C program subsequences paired with their frequencies R programming language for statistical analysis 11/10/2018
17
Related Work G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of Bioinformatics and Computational Biology 1(3) (2003), pp 11/10/2018
18
Stuart and Berry (S&B) vs. Current Approach
Represented genomes as tetrads of peptides Numrows(S&B)=160000 Numrows(current)=16384 S&B~10*current Built phylogenetic tree using PHYLIPv3.6 package 11/10/2018
19
Singular Values 11/10/2018
20
Classification – NOT a Phylogeny
11/10/2018
21
Principle Component Dimensions 1 & 2
11/10/2018
22
Principle Component Dimensions 2 & 4
11/10/2018
23
Principle Component Dimensions 3 & 4
11/10/2018
24
Principle Component Dimensions 5 & 6
11/10/2018
25
Principle Component Dimensions 7& 8
11/10/2018
26
Principle Component Dimensions 8 & 9
11/10/2018
27
Results 7 major groupings Most plasmids Mycobacteria Rhizobiales
Enterobacteriales Archae Gamma-Epsilon Bacilli and Mycoplasma 11/10/2018
28
Plasmids 11/10/2018
29
Mycobacteria, Rhizobiales and Enterobacteria
11/10/2018
30
Archea 11/10/2018
31
Gamma-Epsilon 11/10/2018
32
Bacilli and Mycoplasma
11/10/2018
33
Future Work Analyses and interpretation of heptamer clusters
Establishing relationship between heptamer and genetic clusters Principal Components 11/10/2018
34
Conclusion Flexible Method: Scalable can utilize
all available genetic information subset of it choice of the unit of representation Scalable 11/10/2018
35
References G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of Bioinformatics and Computational Biology 1(3) (2003), pp Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp , Kluwer: Norwell, MA (2003). Basilevsky, A. Statistical factor analysis and related methods: theory and applications New York: J. Wiley, (1994). Ward, Joe H. Hierarchical Grouping to optimize an objective function. Journal of American Statistical Association, 58(301) (1963), Rogozin et al. Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Research, 2002, Vol. 30, No. 19 pp Szymanski et al. Noncoding regulatory RNAs database. Nucleic Acids Research, 2003, Vol. 31, No. 1 pp 11/10/2018
36
Acknowledgements Dr John Paolillo Dr Sun Kim Dr Haixu Tang
Informatics Colleagues 11/10/2018
37
Any Questions? 11/10/2018
38
“A three-way genome comparison of the CFT073, enterohemorrhagic E
“A three-way genome comparison of the CFT073, enterohemorrhagic E.coli EDL933, and laboratory strain MG1655 reveals that, amazingly, only 39.2% of their combined (non-redundant) set of proteins actually are common to all three strains.” (N.Moran, 2004) 11/10/2018
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.