Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovery of Prokaryotic Relationships through Latent Structure

Similar presentations


Presentation on theme: "Discovery of Prokaryotic Relationships through Latent Structure"— Presentation transcript:

1 Discovery of Prokaryotic Relationships through Latent Structure
Natalya Muzinich Advisors: Dr. John Paolillo Dr. Sun Kim

2 Importance of Studying Bacterial Relationships
Identification of regions distinguishing pathogenic from non-pathogenic strains important for general cell functioning Ex. Minimal gene set Ex. Non-coding RNA Better understanding of evolution lost genes re-arranged genes congruent evolution 11/10/2018

3 Common Computational Methodology in Relating Biological Organisms
Alignment BLAST Phylogenies Maximum likelihood Neighbor joining Bayesian Maximum Parsimony 11/10/2018

4 Problems with Existing Methods
Based on a subset of available genetic information Potentially leads to inaccurate results Regions non-coding for proteins are typically excluded from analysis They are shown to be important: “for decades the functions of RNA in the cells was grossly underestimated. It is difficult to assess the number of noncoding RNAs yet to be discovered. Unlike protein-coding genes…, RNA-coding genes are much more difficult to find within the genomic sequences.” (Szymanski et al, 2003) 11/10/2018

5 Problems with Existing Methods
Computationally expensive Problems Sequence length for BLAST Number of genomes for many methods of phylogenetic tree construction ex. runtime is in Ο(n!) for maximum parsimony Very limited number of genomes can be classified ex. Around 10 with maximum parsimony 11/10/2018

6 Project Goals Utilize information in the whole genomic sequence
Scalability Be able to compare a large number of genomes quickly 11/10/2018

7 Proposal: Divide-and-Conquer
Divide a complete genomic sequence into overlapping subsequences Construct genomic representation from the subsequences (cmp. shotgun sequencing) 11/10/2018

8 Representation Construction
Select subsequence length (7) Count all occurrences of every possible (47=16384) subsequence within a unit of genetic information Arrange the pairing the of subsequences with the frequency of their occurrence into a named vector Collect all vectors into a matrix Units can be chromosomes, plasmids, or complete genome. Depends on purposes of analysis 11/10/2018

9 Data Matrix 11/10/2018

10 Genomes Represented by the same subsequences
Differ in the distribution of the subsequences 11/10/2018

11 Proposal: Latent Structure Modeling
Reasons behind correlations? Statistical method capturing correlations in the observed data through unobserved (latent) variables Principle Component Analysis (PCA) Singular value decomposition (SVD) detects these correlations 11/10/2018

12 SVD Requires data arranged in a matrix
Values are correlations between the observed data (rows) and their representation (columns) In this project rows are heptamers, columns are genomic units (chromosomes and plasmids) 11/10/2018

13 SVD Decomposes any matrix into a product of three matrices: X=USVT
Derives the system of coordinates in a high dimensional space for the observed data (matrix U) its representation (matrix VT) S measures variance along each dimension 11/10/2018

14 Measuring Similarity: Coordinate System
Allows for the use of vectors for measuring of distance among the data points Distance measures euclidean cosine similarity In this project data is normalized. Euclidean distance works well n this case 11/10/2018

15 Proposal: Hierarchical Clustering
Input: factor scores computed from the results of SVD to derive a matrix F=(√S * VT)T Method: Ward’s Minimizes variance within each cluster Uses error of the sum-of-squares criterion (ESS) Principle Components minimize ESS Clustering correlates with PCA results 11/10/2018

16 Data and Related Tools Data Patdist program of Dr. Kim C program
58 bacterial genomes from NCBI genbank 83 total (plasmids, second chromosomes) log-transformed normalized (by row, Z-scores) Patdist program of Dr. Kim a list of the heptamers at every position in a genome C program subsequences paired with their frequencies R programming language for statistical analysis 11/10/2018

17 Related Work G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of Bioinformatics and Computational Biology 1(3) (2003), pp 11/10/2018

18 Stuart and Berry (S&B) vs. Current Approach
Represented genomes as tetrads of peptides Numrows(S&B)=160000 Numrows(current)=16384 S&B~10*current Built phylogenetic tree using PHYLIPv3.6 package 11/10/2018

19 Singular Values 11/10/2018

20 Classification – NOT a Phylogeny
11/10/2018

21 Principle Component Dimensions 1 & 2
11/10/2018

22 Principle Component Dimensions 2 & 4
11/10/2018

23 Principle Component Dimensions 3 & 4
11/10/2018

24 Principle Component Dimensions 5 & 6
11/10/2018

25 Principle Component Dimensions 7& 8
11/10/2018

26 Principle Component Dimensions 8 & 9
11/10/2018

27 Results 7 major groupings Most plasmids Mycobacteria Rhizobiales
Enterobacteriales Archae Gamma-Epsilon Bacilli and Mycoplasma 11/10/2018

28 Plasmids 11/10/2018

29 Mycobacteria, Rhizobiales and Enterobacteria
11/10/2018

30 Archea 11/10/2018

31 Gamma-Epsilon 11/10/2018

32 Bacilli and Mycoplasma
11/10/2018

33 Future Work Analyses and interpretation of heptamer clusters
Establishing relationship between heptamer and genetic clusters Principal Components 11/10/2018

34 Conclusion Flexible Method: Scalable can utilize
all available genetic information subset of it choice of the unit of representation Scalable 11/10/2018

35 References G.W. Stuart and M.W. Berry, Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space, J. of Bioinformatics and Computational Biology 1(3) (2003), pp Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp , Kluwer: Norwell, MA (2003). Basilevsky, A. Statistical factor analysis and related methods: theory and applications New York: J. Wiley, (1994). Ward, Joe H. Hierarchical Grouping to optimize an objective function. Journal of American Statistical Association, 58(301) (1963), Rogozin et al. Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Research, 2002, Vol. 30, No. 19 pp Szymanski et al. Noncoding regulatory RNAs database. Nucleic Acids Research, 2003, Vol. 31, No. 1 pp 11/10/2018

36 Acknowledgements Dr John Paolillo Dr Sun Kim Dr Haixu Tang
Informatics Colleagues 11/10/2018

37 Any Questions? 11/10/2018

38 “A three-way genome comparison of the CFT073, enterohemorrhagic E
“A three-way genome comparison of the CFT073, enterohemorrhagic E.coli EDL933, and laboratory strain MG1655 reveals that, amazingly, only 39.2% of their combined (non-redundant) set of proteins actually are common to all three strains.” (N.Moran, 2004) 11/10/2018


Download ppt "Discovery of Prokaryotic Relationships through Latent Structure"

Similar presentations


Ads by Google