Www.unil.ch/cbg (http://www2.unil.ch/cbg/index.php?title=UNIL_MSc_course:_%22Case_studies_in_bioinformatics_2017%22)

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Test-tube or keyboard? Computation in the life sciences.
Mon Week 9 Excel Alice Project Options: – Project of the Stars Submit by Wed 11:59 Week 9, also submit questions RE-SUBMIT for regular deadline – Regular.
Prokaryotic Gene Regulation: Lecture 5. Introduction The two types of transcription regulation control in prokaryotic cells The lac operon an inducible.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Chi-Square Test of Independence u Purpose: Test whether two nominal variables are related u Design: Individuals categorized in two ways.
Microarray GEO – Microarray sets database
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Project 6 Using The Analysis ToolPak To Analyze Sales Transactions Jason C. H. Chen, Ph.D. Professor of Management Information Systems School of Business.
Bioinformatics Brad Windle Ph# Web Site:
Gene expression analysis
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Overview of Bioinformatics 1 Module Denis Manley..
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Gene Expression Omnibus (GEO)
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
Statistical Testing with Genes Saurabh Sinha CS 466.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Introduction to biological molecular networks
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computer Science and Engineering PhD in Computer Science Monday, November 07, :00 a.m. – 11:00 a.m. Swearingen Conference Room 3A75 Network Based.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Microarray: An Introduction
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
EQTLs.
CS Fall 2016 (Shavlik©), Lecture 5
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Control of Gene Expression
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Statistical Testing with Genes
Unit 8.2 How data can be used Accessing Data in Views
Amino acids (protein building blocks) are coded for by mRNA base sequences.
Hypothesis Testing Review
Dept of Biomedical Informatics University of Pittsburgh
How are proteins made from the DNA sequence?
محاضرة عامة التقنيات الحيوية (هندسة الجينات .. مبادئ وتطبيقات)
Python I/O.
1 Department of Engineering, 2 Department of Mathematics,
Gene Expression Omnibus (GEO)
Gene Expression Analysis and Proteins
Consider this table: The Χ2 Test of Independence
Scientific Process and Themes of Biology
1 Department of Engineering, 2 Department of Mathematics,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
1 Department of Engineering, 2 Department of Mathematics,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
9 Future Challenges for Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Cursors Organized by Farrokh Alemi, Ph.D. Narrated by Yara Alemi
CHAPTER 11 The Control of Gene Expression
Advanced PGDB Editing: Gene Ontology (GO) Terms
Statistics for the Social Sciences
Statistics for the Social Sciences
Regents Review.
INC 161 , CPE 100 Computer Programming
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
An Overview of Gene Expression
Statistical Testing with Genes
Identification of aging-related genes and affected biological processes. Identification of aging-related genes and affected biological processes. (A) Experimental.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CHI SQUARE (χ2) Dangerous Curves Ahead!.
Presentation transcript:

www.unil.ch/cbg (http://www2.unil.ch/cbg/index.php?title=UNIL_MSc_course:_%22Case_studies_in_bioinformatics_2017%22)

Sven David Giovanni

“… published bioinformatics analyses will be reexamined critically in a hands-on fashion.” “… quite common that Master or PhD projects build on exiting work.” “… provide written report on one of the modules” “ this course puts emphasis on developing the analysis and programming skills”

? Module 1: Is the hourglass model for gene expression really supported by the experimental data? ?

Module 3: Binding specificity in protein interactions proteins Lipids RNA Biological systems are extremely complex with thousands of proteins and other molecules in every cell. A key property of these proteins is their ability to bind to their cognate partners with high specificity, despite the sea of other molecules that are surrounding them. Understanding how different proteins recognize their partners is therefore fundamental to have a complete view of cellular mechanisms. ions DNA Metabolites

What will you learn: The problem The approach Why certain alterations in cancer are mutually exclusive? And why it is interesting to identify them? The approach How do we identify mutually exclusive alterations? The computational challenge How our assumptions on “what is expected” (null model design) affect our results?

Module 4: Significant mutual exclusivity between alterations in cancer KRAS HRAS NRAS BRAF CANCER

What you will learn How to model protein-protein interactions and binding specificity (probabilistic model). How to cluster proteins based on different properties (sequence similarity / binding specificity). How to predict new protein interactions using the sequence of proteins. Here we will see how one can model the binding specificity of some proteins using only sequence information. We will also see how we can use these models to classify proteins, not only based on their sequence but also based on their functional properties. Finally we will briefly discuss how we can use this information to predict new protein-protein interactions. ESFLTWL

? Module 1: Is the hourglass model for gene expression really supported by the experimental data? ?

Let’s get going! Task 1: Get the relevant data Task 2: Compute TAI Gene expression data (GSE24616) Age index (aka ps=phylostrata) Task 2: Compute TAI Convert expression data from probe to genes Match genes IDs across the 2 datasets Compute weighted sum Task 3: Reproduce Figure 1a Task 4: Critical re-analysis

Module 1: Is the hourglass model for gene expression really supported by the data?

Find data on the web Look for python package for Gene Expression Omnibus Database (GEO) Age index??

Import data and extract relevant information Steps: Install and import python packages Download experiments data used in the paper Save them with a variable name (for example gse) An example can be find on: https://geoparse.readthedocs.io/en/latest/Analyse_hsa-miR-124a-3p_transfection_time-course.html

Look to the gse data PLATFORM (gpls) and SAMPLE (gsms) of GEOparse object: Type of data Information inside Metadata? Columns? Table? Useful command: head()

>>> gse. gsms['GSM607008'] >>> gse.gsms['GSM607008'].head() SAMPLE GSM607008 - Metadata: !Sample_title = adult_1y2m_male_rep2 !Sample_geo_accession = GSM607008 !Sample_characteristics_ch1 = strain: wild type !Sample_characteristics_ch1 = developmental stage: adult !Sample_characteristics_ch1 = developmental timing: 1y2m - Columns: description ID_REF VALUE processed Cy3 signal intensity - Table: Index ID_REF VALUE 0 1 20457.390000 1 2 2.546218 2 3 2.938484 3 4 9.200152 4 5 2.611924

Sample information extraction Info needed for reproduce figure: Sample name Stage Time Gender (only female and mixed are used) Hint: Look into the metadata characteristics_ch1 name

Steps Create an empty dictionary Using a for loop, fill the dictionary with the information for each sample Useful command: append() split() strip()

Expression data Extract information for only one sample Example: gsm = gse.gsms['GSM607008'] Extract information for all samples Hint: GEOparse documentation GEOparse example

expression_data = gse.pivot_samples('VALUE')

Match age index and expression data Look at the data and find the information allowing the match between the two dataset Hint: rename rows in both dataset Match dataset: Useful command join() groupby() last()

Select data One gene can have multiple probes For multiple probes take the mean Useful command: groupby() mean() Only use mixed and female sample Hint: create a data frame of the characteristics and select mixed and female sample Create a new column classifying the time line of the development

Need to give code (probably) #### sort time char_df_mixed['timing_number'] = 0 time_stamps = char_df_mixed.time.unique() for i in range(len(time_stamps)): char_df_mixed.loc[char_df_mixed.time==time_stamps[i],'timing_number'] =i+1

Select sample with similar time point and average them Useful command to select sample ID reset.index() groupeby() apply(lambda x: np.array(x)) Useful command to calculate the mean for loop

Calculate the TAI index