Discovery of Prokaryotic Relationships through Latent Structure

Discovery of Prokaryotic Relationships through Latent Structure
Natalya Muzinich Advisors: Dr. John Paolillo Dr. Sun Kim

Importance of Studying Bacterial Relationships
Identification of regions distinguishing pathogenic from non-pathogenic strains important for general cell functioning Ex. Minimal gene set Ex. Non-coding RNA Better understanding of evolution lost genes re-arranged genes congruent evolution 11/10/2018

Common Computational Methodology in Relating Biological Organisms
Alignment BLAST Phylogenies Maximum likelihood Neighbor joining Bayesian Maximum Parsimony 11/10/2018

Problems with Existing Methods
Based on a subset of available genetic information Potentially leads to inaccurate results Regions non-coding for proteins are typically excluded from analysis They are shown to be important: “for decades the functions of RNA in the cells was grossly underestimated. It is difficult to assess the number of noncoding RNAs yet to be discovered. Unlike protein-coding genes…, RNA-coding genes are much more difficult to find within the genomic sequences.” (Szymanski et al, 2003) 11/10/2018

Problems with Existing Methods
Computationally expensive Problems Sequence length for BLAST Number of genomes for many methods of phylogenetic tree construction ex. runtime is in Ο(n!) for maximum parsimony Very limited number of genomes can be classified ex. Around 10 with maximum parsimony 11/10/2018

Project Goals Utilize information in the whole genomic sequence
Scalability Be able to compare a large number of genomes quickly 11/10/2018

Proposal: Divide-and-Conquer
Divide a complete genomic sequence into overlapping subsequences Construct genomic representation from the subsequences (cmp. shotgun sequencing) 11/10/2018

Representation Construction
Select subsequence length (7) Count all occurrences of every possible (47=16384) subsequence within a unit of genetic information Arrange the pairing the of subsequences with the frequency of their occurrence into a named vector Collect all vectors into a matrix Units can be chromosomes, plasmids, or complete genome. Depends on purposes of analysis 11/10/2018

Data Matrix 11/10/2018

Genomes Represented by the same subsequences
Differ in the distribution of the subsequences 11/10/2018

Proposal: Latent Structure Modeling
Reasons behind correlations? Statistical method capturing correlations in the observed data through unobserved (latent) variables Principle Component Analysis (PCA) Singular value decomposition (SVD) detects these correlations 11/10/2018

SVD Requires data arranged in a matrix
Values are correlations between the observed data (rows) and their representation (columns) In this project rows are heptamers, columns are genomic units (chromosomes and plasmids) 11/10/2018

SVD Decomposes any matrix into a product of three matrices: X=USVT
Derives the system of coordinates in a high dimensional space for the observed data (matrix U) its representation (matrix VT) S measures variance along each dimension 11/10/2018

Measuring Similarity: Coordinate System
Allows for the use of vectors for measuring of distance among the data points Distance measures euclidean cosine similarity In this project data is normalized. Euclidean distance works well n this case 11/10/2018

Proposal: Hierarchical Clustering
Input: factor scores computed from the results of SVD to derive a matrix F=(√S * VT)T Method: Ward’s Minimizes variance within each cluster Uses error of the sum-of-squares criterion (ESS) Principle Components minimize ESS Clustering correlates with PCA results 11/10/2018

Data and Related Tools Data Patdist program of Dr. Kim C program
58 bacterial genomes from NCBI genbank 83 total (plasmids, second chromosomes) log-transformed normalized (by row, Z-scores) Patdist program of Dr. Kim a list of the heptamers at every position in a genome C program subsequences paired with their frequencies R programming language for statistical analysis 11/10/2018

Related Work G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of Bioinformatics and Computational Biology 1(3) (2003), pp 11/10/2018

Stuart and Berry (S&B) vs. Current Approach
Represented genomes as tetrads of peptides Numrows(S&B)=160000 Numrows(current)=16384 S&B~10*current Built phylogenetic tree using PHYLIPv3.6 package 11/10/2018

Singular Values 11/10/2018

Classification – NOT a Phylogeny
11/10/2018

Principle Component Dimensions 1 & 2
11/10/2018

11/10/2018

Principle Component Dimensions 7& 8
11/10/2018

11/10/2018

Results 7 major groupings Most plasmids Mycobacteria Rhizobiales
Enterobacteriales Archae Gamma-Epsilon Bacilli and Mycoplasma 11/10/2018

Plasmids 11/10/2018

Mycobacteria, Rhizobiales and Enterobacteria
11/10/2018

Archea 11/10/2018

Gamma-Epsilon 11/10/2018

Bacilli and Mycoplasma
11/10/2018

Future Work Analyses and interpretation of heptamer clusters
Establishing relationship between heptamer and genetic clusters Principal Components 11/10/2018

Conclusion Flexible Method: Scalable can utilize
all available genetic information subset of it choice of the unit of representation Scalable 11/10/2018

References G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of Bioinformatics and Computational Biology 1(3) (2003), pp Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp , Kluwer: Norwell, MA (2003). Basilevsky, A. Statistical factor analysis and related methods: theory and applications New York: J. Wiley, (1994). Ward, Joe H. Hierarchical Grouping to optimize an objective function. Journal of American Statistical Association, 58(301) (1963), Rogozin et al. Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Research, 2002, Vol. 30, No. 19 pp Szymanski et al. Noncoding regulatory RNAs database. Nucleic Acids Research, 2003, Vol. 31, No. 1 pp 11/10/2018

Acknowledgements Dr John Paolillo Dr Sun Kim Dr Haixu Tang
Informatics Colleagues 11/10/2018

Any Questions? 11/10/2018

“A three-way genome comparison of the CFT073, enterohemorrhagic E
“A three-way genome comparison of the CFT073, enterohemorrhagic E.coli EDL933, and laboratory strain MG1655 reveals that, amazingly, only 39.2% of their combined (non-redundant) set of proteins actually are common to all three strains.” (N.Moran, 2004) 11/10/2018

Discovery of Prokaryotic Relationships through Latent Structure

Similar presentations

Presentation on theme: "Discovery of Prokaryotic Relationships through Latent Structure"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discovery of Prokaryotic Relationships through Latent Structure

Similar presentations

Presentation on theme: "Discovery of Prokaryotic Relationships through Latent Structure"— Presentation transcript:

Similar presentations

About project

Feedback