Knowledge Engine for Genomics (KnowEnG):

Slides:



Advertisements
Similar presentations
Data Visualization in Molecular Biology Alexander Lex July 29, 2013.
Advertisements

Statistical methods and tools for integrative analysis of perturbation signatures Mario Medvedovic Laboratory for Statistical Genomics and Systems Biology.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Network-based stratification of tumor mutations Matan Hofree.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.
Cancer is heterogeneous disease! -> enabled characterization of new tumor subtypes for improving personalized treatment and ultimately achieving better.
Networks and Interactions Boo Virk v1.0.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Finish up array applications Move on to proteomics Protein microarrays.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
EB3233 Bioinformatics Introduction to Bioinformatics.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Anthony Gitter Cancer Bioinformatics (BMI 826/CS 838) May 5, 2015
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.
Introduction to Oncomine Xiayu Stacy Huang. Oncomine is a cancer-specific microarray database and has a web-based data-mining platform aimed at facilitating.
(1) Genotype-Tissue Expression (GTEx) Largest systematic study of genetic regulation in multiple tissues to date 53 tissues, 500+ donors, 9K samples, 180M.
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Saskatoon SAS user group
High-throughput genomic profiling of tumor-infiltrating leukocytes
Pichai Raman on behalf of cBioPortal Team Wednesday, May 25, 16
David Amar, Tom Hait, and Ron Shamir
Sungkyunkwan University, School of Medicine.
Knowledge-Guided Analysis with KnowEnG Lab
Networks and Interactions
A graph-based integration of multiple layers of cancer genomics data (Progress Report) Do Kyoon Kim 1.
GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer Morteza H.Chalabi, Fabio Vandin Hello.
A web portal for management of biological data and applications
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
MATLAB Distributed, and Other Toolboxes
An Artificial Intelligence Approach to Precision Oncology
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Web-based Tools for Integrative Analysis of Pancreatic Cancer Data
Canadian Bioinformatics Workshops
Spark Presentation.
The PedcBioPortal & DiseaseXpress
Gene expression.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Hansheng Xue School of Computer Science and Technology
NView Overview We developed this tool as part of a team of visualization and biomedical researchers to better understand the physiology of DBS and patient.
Dept of Biomedical Informatics University of Pittsburgh
Gene Signatures and Knowledge-Guided Gene Set Characterization Lab
September 11, Ian R Brooks Ph.D.
Population Information Integration, Analysis and Modeling
Functional Annotation of the Horse Genome
Computer Science & Engineering Department University of Connecticut
CS110: Discussion about Spark
EOSCpilot All Hands Meeting 8 March 2018 Pisa
Pathway Informatics December 5, 2018 Ansuman Chattopadhyay, PhD
Schedule for the Afternoon
Genome Biology & Applied Bioinformatics Mehmet Tevfik DORAK, MD PhD
Using the Cloud App Marketplace Monitoring cloud app migrations
ASSIGNMENT 10 Use the UP and DOWN tags as provided and query the LINCS. These are the KRAS-DEP (UP) and KRAS-IND (DOWN) gene signatures from Singh et al.
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Anastasia Baryshnikova  Cell Systems 
Network biology An introduction to STRING and Cytoscape
Volume 29, Issue 5, Pages (May 2016)
Altered Caspase-8 Expression
Knowledge-Guided Sample Clustering
Cancer Cell Line Encyclopedia
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
PD-L1 expression correlates with T-cell markers and an IFN response signature in human melanomas. PD-L1 expression correlates with T-cell markers and an.
Highly metastatic PDAC cells have a unique gene signature, which is not preserved in metastases but predicts poor patient outcome. Highly metastatic PDAC.
Volume 28, Issue 4, Pages e6 (July 2019)
Presentation transcript:

Knowledge Engine for Genomics (KnowEnG): Cloud-based Environment for Scalable Analyses of Genomic Signatures Charles Blatti Postdoctoral Research Associate KnowEnG Center of Excellence in Big Data Computing, University of Illinois at Urbana-Champaign November, 2016 Hello, my name is Charles Blatti. I am a postdoctoral researcher in KnowEnG from the University of Illinois. I am here today to present our Knowledge Engine for Genomics platform.

Genomic Data Analysis Using Prior Knowledge in a Scalable Cloud Knowledge Network User Interface Physical interactions, co-expression, pathways, biological processes, text mining, etc. Analysis Pipelines Subtype Stratification Network Smoothing NMF Clustering Consensus Clustering Samples The primary purpose of the KnowEnG platform is to enable analysis of genomic data using prior knowledge on a scalable cloud infrastructure.   We aim to build a tool that can serve the medical or biological researcher who has genomic profiles from multiple patients or samples. This user spreadsheet may contain RNA-seq or somatic mutation data, for example KnowEnG provides these users with the latest machine learning and graph mining analysis tools in the form of common bioinformatics pipelines. For example, our sample clustering pipeline can be used for cancer patient subtype stratification using non-negative matrix factorization.. The results of all the pipelines can be explored though an intuitive user interface. For each of our pipelines, we incorporate novel analysis methods that explicitly use prior biological knowledge. We collect and standardize gene and protein interactions and annotations from many public databases. We call this data the Knowledge Network and make it available for computation through the cloud. The analysis pipelines then incorporate the data of the Knowledge Network , in this example, our subtype stratification starts with network smoothing step. RNA-seq, Somatic Mutations, etc.. Genes User Spreadsheet

Genomic Data Analysis Using Prior Knowledge in a Scalable Cloud Knowledge Network User Interface Physical interactions, co-expression, pathways, biological processes, text mining, etc. Analysis Pipelines Subtype Stratification Network Smoothing NMF Clustering Consensus Clustering Samples We have worked to deploy the KnowEnG pipelines on scalable cloud architectures, so researchers without their own IT departments to be able to run large-scale analyses on big user and public datasets. First, tools are placed in Docker containers for easy deployment on cloud resources. Then, all pipelines are structured to maximize independent, parallel computation. When a user requires a large run, we can scale up the size of the compute cloud to ensure that it finishes in a timely manner. RNA-seq, Somatic Mutations, etc.. Genes User Spreadsheet

Gene Prioritization Research Application Amin Emad Knowledge-Guided Prioritization of Genes Determinant of Drug Resistance Using ProGENI Research Highlights Session 2 Now I would like to focus on one interesting research application and how to perform this analysis in our KnowEnG environment. This research was conducted by Amin Emad in collaboration with the Mayo Clinic and he will be presenting it in more detail in tomorrow’s Research Highlights session. The main goal of Amin’s research was to identify genes whose mRNA expression levels related to the variation of drug sensitivity in different cell lines. To recreate Amin’s research, we log into our KnowEnG account and are taken to the platform’s home page. From there we see our previous analysis runs and can jump to the available research pipelines. We are interested in performing gene prioritization using the Knowledge Network

Genomics of Drug Sensitivity in Cancer Data GDSC Data Basal Expression Data 600 Cell Lines 13,000 Genes The user spreadsheets for our analysis are drug response datasets from the GDSC. The genomic spreadsheet contains the basal transcription levels for about 600 human cell lines. The response spreadsheet contains drug response values for those cell lines for 139 cytotoxic treatments. We can upload these into the system to begin the analysis. Drug Sensitivity Data 600 Cell Lines 139 Drugs

Gene Prioritization Measure First we must choose the gene ranking measure to be used by the prioritization pipeline. Even if we were not planning to use the prior information in the Knowledge Network, the standard pipeline with the Absolute Pearson Correlation selected would similar to the one used in the recent Rees et al and with the Elastic Net selected would be similar to Barrentina et al.

Incorporating the Knowledge Network But since we are interesting in incorporating the prior knowledge in the Knowledge Network, Amin developed a novel method for gene prioritization that does so called ProGENI. ProGENI starts by transforming the gene expression data based on the gene neighborhoods in the Knowledge Network and ends with a Knowledge Network-based ranking of all genes. In his paper, Amin only considered the direct, physical interactions that are experimentally measured and cataloged by the STRING consortium. However, this is just one of many types of gene interactions that are available for analysis in the Knowledge Network ProGENI Network Transform Prioritization Measure Network Ranking

Robust Prioritization by Resampling ProGENI The gene prioritization pipeline is designed to return a robust ranking of gene features by sampling the dataset many times and finding a consensus ranking. This computationally intensive, but trivially parallelizable addition to the pipeline is accommodated by KnowEnG’s scalable cloud infrastructure and the robust rankings can be calculated in reasonable times. Network Transform Prioritization Measure Network Ranking

Running the Pipeline Docker Containers Scheduled by Chronos Managed by Mesos Finally, the user is taken to a page to review their configuration of the pipeline and submit the job. At this point, the workflow is deployed as a series of containerized tasks … …who are managed by the Chronos job scheduler framework to run on the compute cluster managed by Apache Mesos. Each task can read and write user and public data from cloud storage. When the results are available, the user is able to access the visualizations produced by the interface. Synced to Cloud Storage

Visualizing the Results The output visualization of our Gene Prioritization example is a heatmap. The columns are the drugs tested in the GDSC and the rows are the top related genes. The highlight example demonstrates a instance when a gene was scored as important to a drug with ProGENI, but would have been missed in the non-network approach. The interface allows the user to filter the drug columns and the gene rows and reorder them by several criteria. We now have a gene set of important genes for each cytotoxic treatment and we want to learn more about them.

Porting Results to Gene Set Characterization To do this, we save the results of gene prioritization and send it to our second KnowEnG pipeline, Gene Set Characterization. The pipeline launcher for Gene Set Characterization follows the same pattern, and our first step is to select our saved drug response gene sets as our user spreadsheet.

Choosing Public Gene Sets Standard GSC Method Popular Webtools The standard method for doing gene set characterization is to perform enrichment test with many collections of public gene sets. This functionality is made available through popular webtools like David and Enrichr. We can select many different collections of public gene sets in the Knowledge Network from many different sources, or use the default collections from Gene Ontology and KEGG pathways. Annotation Gene Sets Characteristic Gene Sets Experimental Gene Sets

Incorporating the Knowledge Network Heterogeneous Edge Types GO_edge KEGG_edge HumanNet_edge Again, in addition to using the standard methods, we will use a network ranking approach called DRaWR to find important public gene sets related to our drug response gene sets. This method integrates multiple types of knowledge into a single heterogeneous network and then uses a random walk algorithm to rank the nodes that represent the public gene set nodes for proximity to the given drug response genes.

Visualizing the GSC Results We submit the pipeline, it is deployed on the compute cloud, and we can see the results when it is completed. Here again, we are given a heatmap where now the columns are public gene sets and the rows are our drug response related sets. We again can investigate a single results or filter and save the results shown in the interface.

Sample Clustering / Subtype Stratification As I mentioned in the overview, the third pipeline in KnowEnG is our sample clustering pipeline and based on the work in Hofree et al. Given a user spreadsheet of patient somatic mutations we try to find genomic subtypes that relate to important phenotypic outcomes like survival or treatment response. Like all pipelines, sample clustering can be run in simple or in Knowledge Network-aware modes.

Upcoming Features Integration with Other Clouds Import user spreadsheets directly from other cloud-based datasets like TCGA, LINCS New Workflows Gene Regulatory Networks – Model interactions between transcripts and transcription factors Text Mining – Find genes most specifically related to different disease terminology Phenotype Prediction – Create model that predicts phenotypic outcomes from genomic data Finally, we are working towards integrating our platform with large cloud datasets like the TCGA and LINCs data that would serve as user spreadsheets for advanced genomic analysis. We are developing additional pipelines for KnowEnG that focus on the common tasks of building gene regulatory networks, ranking genes for diseases based on text mining approaches, and creating models to predict phenotypic outcomes from genotypic data.

KnowEnG Development Team Thank You! Thank You! Come see our demo at Poster #75 KnowEnG Development Team If you are interested in KnowEnG or have any questions, please stop by our demo today or tomorrow at poster #75. Thank you for your attention. Research and Design: Saurabh Sinha, Colleen Bushell, Matt Berry, Lisa Gatzke, Amin Emad, Charles Blatti, and Sheng Wang Pipelines and Infrastructure: Nahil Sobh, Dan Lanier, Milt Epstein, Xi Chen, Suyang Chen, Jing Ge, Pramod Rizal, Omar Sobh, Aidan Epstein, Corey Post