Seojin Bang
The goal of this review paper is.. To address problems and computational solutions that arise in analysis of omics data. To highlight fundamental algorithmic ideas that serve as a launching point for extracting biological insights from omics data.
PART1. Processing, storage, and retrieval of high-throughput sequencing data PART3. Integrative interactomics This review focuses on three important areas. R.Q. Wu et al. J DENT RES 2010;90: PART2. Data mining for transcriptomics
PART1. Processing, storage, and retrieval of high-throughput sequencing data PART2. Data mining for transcriptomics PART3. Integrative interactomics R.Q. Wu et al. J DENT RES 2010;90:
PART 1 TGAT CATG TGGACG AGTTCT CCGTGT AAT GTTAG CGTAC CAGTTG CTCGT Original Sequence fragmentation Sequencing ATGCGG TAGCCG TCGACG GTAG AGTACT TACCA CTCG CGTA ATGT Assembly TAGCCG CCGTGT GTAG … CGTAC TACCA TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Assembled Sequence Alignment (Read Mapping) TAGCCG TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA CTCGT CGTAC TACCAGTAG CCGTGT GTAG AGTTCT TGAT CATG TGGACG AAT ATGT GTTAG AGTACT Reference genome TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Aligned Sequence
PART 1 TGAT CATG TGGACG AGTTCT CCGTGT AAT GTTAG CGTAC CAGTTG CTCGT Original Sequence fragmentation Sequencing ATGCGG TAGCCG TCGACG GTAG AGTACT TACCA CTCG CGTA ATGT Assembly TAGCCG CCGTGT GTAG … CGTAC TACCA TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Assembled Sequence Alignment (Read Mapping) TAGCCG TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA CTCGT CGTAC TACCAGTAG CCGTGT GTAG AGTTCT TGAT CATG TGGACG AAT ATGT GTTAG AGTACT Reference genome TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Aligned Sequence Genome Assembly
PART 1 Genome Assembly Too many possible short- sequence pairs to be compared. Problem Use Graphical approaches such as de Bruijn graph Solution
PART 1 TGAT CATG TGGACG AGTTCT CCGTGT AAT GTTAG CGTAC CAGTTG CTCGT Original Sequence fragmentation Sequencing ATGCGG TAGCCG TCGACG GTAG AGTACT TACCA CTCG CGTA ATGT Assembly TAGCCG CCGTGT GTAG … CGTAC TACCA TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Assembled Sequence Alignment (Read Mapping) TAGCCG TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA CTCGT CGTAC TACCAGTAG CCGTGT GTAG AGTTCT TGAT CATG TGGACG AAT ATGT GTTAG AGTACT Reference genome TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Aligned Sequence Read Mapping
PART 1 Read Mapping Huge running times and shortage of storages to save ref. genome Problem Use FM-index technique that is a hybrid of BWT(Burrows- Wheeler transformation) and suffix array. Solution BWT Suffix Array
PART 1 TGAT CATG TGGACG AGTTCT CCGTGT AAT GTTAG CGTAC CAGTTG CTCGT Original Sequence fragmentation Sequencing ATGCGG TAGCCG TCGACG GTAG AGTACT TACCA CTCG CGTA ATGT Assembly TAGCCG CCGTGT GTAG … CGTAC TACCA TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Assembled Sequence Alignment (Read Mapping) TAGCCG TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA CTCGT CGTAC TACCAGTAG CCGTGT GTAG AGTTCT TGAT CATG TGGACG AAT ATGT GTTAG AGTACT Reference genome TAGCCGTGTAGTTCTCGTTAGTACTCGTAGGACGAATGTCGTACCA Aligned Sequence Large-scale genome sequence compressed storage and search
Data are compressed in such a way that they can be efficiently and accurately searched without decompressing first. Flow chart of CaBLAST Solution Previous compressive techniques require the data to be decompressed before computational analysis. As size of genomic library is getting larger, any computational analysis that runs on the full genomic library take a long time. Problem
PART1. Processing, storage, and retrieval of high-throughput sequencing data PART3. Integrative interactomics R.Q. Wu et al. J DENT RES 2010;90: PART2. Data mining for transcriptomics
Identifying cell-specific expression signals Heterogeneity of cell types may confound gene expression analysis. Problem Use Linear Mixed Model to identify expression profiles for each cell type from overall expression signals. Solution Overall expression signals Cell type specific signals
Identifying regulatory genes and modules in a disease-based analysis How can we construct gene regulatory network in such a way that the network is sparsely structured. Problem 1. Remove non-significant correlations between two genes using pre-defined threshold or penalized method such as lasso 2. Find a gene set of minimum size such that its expression profile linearly fit the given genes of interest. (SPARCLE) Solution Subnetwork of the breast cancer gene regulatory network for the biological process cell cycle Emmert-Streib et al. Front. Genet. 2014
Identifying gene expression alternations in disease Problem: How to distinguish passenger and driver genes of a cancer in the copy number variation region? Solution: CONEXIC integrates copy number variations and gene expression data from tumor samples to identify driving mutations and the processes they influence. Problem: Genetic alterations between patients with same disease can differ. but often involve common pathways. Solution: PARADIGM and PARADIGM-SHIFT construct pathways from cancer transcriptomic profiling data sets because genetic alterations between patients often involve common pathways.
PART1. Processing, storage, and retrieval of high-throughput sequencing data PART3. Integrative interactomics R.Q. Wu et al. J DENT RES 2010;90: PART2. Data mining for transcriptomics
Analysis of heterogeneous genomic data set Networks or interactomes are commonly represented as graphs. We can define subnetwork (modules) as we did for protein- protein and regulatory interaction networks. How can we find modules that are specific to conditions of interest? Problem Node: gene, RNA, protein or metabolite Edge: known interactions among them Solution
Intractome analysis of disease data sets How can we test the modularity of genes that are putatively associated with a specific disease? Problem To assess significance values of each module by comparing with those computed on randomized network. Solution Although genes underlying a disease may differ among individuals, pathways are likely to be shared and thus proteins associated with the same disease have a tendency to interact.
Conclusion and Future Prospects We addressed problems and computational solutions that arise in analysis of omics data. Compressive techniques for next-generation sequencing read data sets and their quality scores remains a major challenge. Transcriptomic data shifts from microarray to next generation sequencing. We will also need to develop transcriptomic analysis methods to handle the new form of data. Much future work in integrative interactomics will focus on characterizing the differences that distinguish individuals and cells from other.