Knowledge Engine for Genomics (KnowEnG):

Knowledge Engine for Genomics (KnowEnG):
Cloud-based Environment for Scalable Analyses of Genomic Signatures Charles Blatti Postdoctoral Research Associate KnowEnG Center of Excellence in Big Data Computing, University of Illinois at Urbana-Champaign November, 2016 Hello, my name is Charles Blatti. I am a postdoctoral researcher in KnowEnG from the University of Illinois. I am here today to present our Knowledge Engine for Genomics platform.

Genomic Data Analysis Using Prior Knowledge in a Scalable Cloud
Knowledge Network User Interface Physical interactions, co-expression, pathways, biological processes, text mining, etc. Analysis Pipelines Subtype Stratification Network Smoothing NMF Clustering Consensus Clustering Samples The primary purpose of the KnowEnG platform is to enable analysis of genomic data using prior knowledge on a scalable cloud infrastructure. We aim to build a tool that can serve the medical or biological researcher who has genomic profiles from multiple patients or samples. This user spreadsheet may contain RNA-seq or somatic mutation data, for example KnowEnG provides these users with the latest machine learning and graph mining analysis tools in the form of common bioinformatics pipelines. For example, our sample clustering pipeline can be used for cancer patient subtype stratification using non-negative matrix factorization.. The results of all the pipelines can be explored though an intuitive user interface. For each of our pipelines, we incorporate novel analysis methods that explicitly use prior biological knowledge. We collect and standardize gene and protein interactions and annotations from many public databases. We call this data the Knowledge Network and make it available for computation through the cloud. The analysis pipelines then incorporate the data of the Knowledge Network , in this example, our subtype stratification starts with network smoothing step. RNA-seq, Somatic Mutations, etc.. Genes User Spreadsheet

Genomic Data Analysis Using Prior Knowledge in a Scalable Cloud
Knowledge Network User Interface Physical interactions, co-expression, pathways, biological processes, text mining, etc. Analysis Pipelines Subtype Stratification Network Smoothing NMF Clustering Consensus Clustering Samples We have worked to deploy the KnowEnG pipelines on scalable cloud architectures, so researchers without their own IT departments to be able to run large-scale analyses on big user and public datasets. First, tools are placed in Docker containers for easy deployment on cloud resources. Then, all pipelines are structured to maximize independent, parallel computation. When a user requires a large run, we can scale up the size of the compute cloud to ensure that it finishes in a timely manner. RNA-seq, Somatic Mutations, etc.. Genes User Spreadsheet

Gene Prioritization Research Application
Amin Emad Knowledge-Guided Prioritization of Genes Determinant of Drug Resistance Using ProGENI Research Highlights Session 2 Now I would like to focus on one interesting research application and how to perform this analysis in our KnowEnG environment. This research was conducted by Amin Emad in collaboration with the Mayo Clinic and he will be presenting it in more detail in tomorrow’s Research Highlights session. The main goal of Amin’s research was to identify genes whose mRNA expression levels related to the variation of drug sensitivity in different cell lines. To recreate Amin’s research, we log into our KnowEnG account and are taken to the platform’s home page. From there we see our previous analysis runs and can jump to the available research pipelines. We are interested in performing gene prioritization using the Knowledge Network

Genomics of Drug Sensitivity in Cancer Data
GDSC Data Basal Expression Data 600 Cell Lines 13,000 Genes The user spreadsheets for our analysis are drug response datasets from the GDSC. The genomic spreadsheet contains the basal transcription levels for about 600 human cell lines. The response spreadsheet contains drug response values for those cell lines for 139 cytotoxic treatments. We can upload these into the system to begin the analysis. Drug Sensitivity Data 600 Cell Lines 139 Drugs

Gene Prioritization Measure
First we must choose the gene ranking measure to be used by the prioritization pipeline. Even if we were not planning to use the prior information in the Knowledge Network, the standard pipeline with the Absolute Pearson Correlation selected would similar to the one used in the recent Rees et al and with the Elastic Net selected would be similar to Barrentina et al.

Incorporating the Knowledge Network
But since we are interesting in incorporating the prior knowledge in the Knowledge Network, Amin developed a novel method for gene prioritization that does so called ProGENI. ProGENI starts by transforming the gene expression data based on the gene neighborhoods in the Knowledge Network and ends with a Knowledge Network-based ranking of all genes. In his paper, Amin only considered the direct, physical interactions that are experimentally measured and cataloged by the STRING consortium. However, this is just one of many types of gene interactions that are available for analysis in the Knowledge Network ProGENI Network Transform Prioritization Measure Network Ranking

Robust Prioritization by Resampling
ProGENI The gene prioritization pipeline is designed to return a robust ranking of gene features by sampling the dataset many times and finding a consensus ranking. This computationally intensive, but trivially parallelizable addition to the pipeline is accommodated by KnowEnG’s scalable cloud infrastructure and the robust rankings can be calculated in reasonable times. Network Transform Prioritization Measure Network Ranking

Running the Pipeline Docker Containers Scheduled by Chronos
Managed by Mesos Finally, the user is taken to a page to review their configuration of the pipeline and submit the job. At this point, the workflow is deployed as a series of containerized tasks … …who are managed by the Chronos job scheduler framework to run on the compute cluster managed by Apache Mesos. Each task can read and write user and public data from cloud storage. When the results are available, the user is able to access the visualizations produced by the interface. Synced to Cloud Storage

Visualizing the Results
The output visualization of our Gene Prioritization example is a heatmap. The columns are the drugs tested in the GDSC and the rows are the top related genes. The highlight example demonstrates a instance when a gene was scored as important to a drug with ProGENI, but would have been missed in the non-network approach. The interface allows the user to filter the drug columns and the gene rows and reorder them by several criteria. We now have a gene set of important genes for each cytotoxic treatment and we want to learn more about them.

Porting Results to Gene Set Characterization
To do this, we save the results of gene prioritization and send it to our second KnowEnG pipeline, Gene Set Characterization. The pipeline launcher for Gene Set Characterization follows the same pattern, and our first step is to select our saved drug response gene sets as our user spreadsheet.

Choosing Public Gene Sets
Standard GSC Method Popular Webtools The standard method for doing gene set characterization is to perform enrichment test with many collections of public gene sets. This functionality is made available through popular webtools like David and Enrichr. We can select many different collections of public gene sets in the Knowledge Network from many different sources, or use the default collections from Gene Ontology and KEGG pathways. Annotation Gene Sets Characteristic Gene Sets Experimental Gene Sets

Incorporating the Knowledge Network
Heterogeneous Edge Types GO_edge KEGG_edge HumanNet_edge Again, in addition to using the standard methods, we will use a network ranking approach called DRaWR to find important public gene sets related to our drug response gene sets. This method integrates multiple types of knowledge into a single heterogeneous network and then uses a random walk algorithm to rank the nodes that represent the public gene set nodes for proximity to the given drug response genes.

Visualizing the GSC Results
We submit the pipeline, it is deployed on the compute cloud, and we can see the results when it is completed. Here again, we are given a heatmap where now the columns are public gene sets and the rows are our drug response related sets. We again can investigate a single results or filter and save the results shown in the interface.

Sample Clustering / Subtype Stratification
As I mentioned in the overview, the third pipeline in KnowEnG is our sample clustering pipeline and based on the work in Hofree et al. Given a user spreadsheet of patient somatic mutations we try to find genomic subtypes that relate to important phenotypic outcomes like survival or treatment response. Like all pipelines, sample clustering can be run in simple or in Knowledge Network-aware modes.

Upcoming Features Integration with Other Clouds
Import user spreadsheets directly from other cloud-based datasets like TCGA, LINCS New Workflows Gene Regulatory Networks – Model interactions between transcripts and transcription factors Text Mining – Find genes most specifically related to different disease terminology Phenotype Prediction – Create model that predicts phenotypic outcomes from genomic data Finally, we are working towards integrating our platform with large cloud datasets like the TCGA and LINCs data that would serve as user spreadsheets for advanced genomic analysis. We are developing additional pipelines for KnowEnG that focus on the common tasks of building gene regulatory networks, ranking genes for diseases based on text mining approaches, and creating models to predict phenotypic outcomes from genotypic data.

KnowEnG Development Team
Thank You! Thank You! Come see our demo at Poster #75 KnowEnG Development Team If you are interested in KnowEnG or have any questions, please stop by our demo today or tomorrow at poster #75. Thank you for your attention. Research and Design: Saurabh Sinha, Colleen Bushell, Matt Berry, Lisa Gatzke, Amin Emad, Charles Blatti, and Sheng Wang Pipelines and Infrastructure: Nahil Sobh, Dan Lanier, Milt Epstein, Xi Chen, Suyang Chen, Jing Ge, Pramod Rizal, Omar Sobh, Aidan Epstein, Corey Post

Knowledge Engine for Genomics (KnowEnG):

Similar presentations

Presentation on theme: "Knowledge Engine for Genomics (KnowEnG):"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Knowledge Engine for Genomics (KnowEnG):

Similar presentations

Presentation on theme: "Knowledge Engine for Genomics (KnowEnG):"— Presentation transcript:

Similar presentations

About project

Feedback