Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE150a – Genomic Signal and Information Processing Seminar series –lectures on first 3 meetings, followed by students presentations –statistical signal.

Similar presentations


Presentation on theme: "EE150a – Genomic Signal and Information Processing Seminar series –lectures on first 3 meetings, followed by students presentations –statistical signal."— Presentation transcript:

1 EE150a – Genomic Signal and Information Processing Seminar series –lectures on first 3 meetings, followed by students presentations –statistical signal processing basics –background reading for each meeting Location: Moore 080 (except today) List of papers with links: www.its.caltech.edu/~hvikalo/gsip.htmlwww.its.caltech.edu/~hvikalo/gsip.html –minor modifications of the list are likely Contact: Haris Vikalo, Moore 125 –Phone: 395-4184 –E-mail: hvikalo@caltech.eduhvikalo@caltech.edu

2 Occasionally check website for updates and increasing list of research related links Today’s handouts: –basic course info and a list of papers –R. Karp’s “Mathematical Challenges from Genomics and Molecular Biology” –sign-up sheet Next time: Prof. Vaidyanathan’s lecture on “Signal Processing Problems in Genomics” In two weeks: lecture on DNA microarray technology and novel estimation techniques of gene expression levels Today: introduction with brief overview of the topics for presentation

3 Central Dogma of Molecular Biology Flow of information in a cell: [Due to Francis Crick. It has recently been realized that the dogma requires modifications but more about that later in course.] Recent development of high-throughput technologies that study the above flow –requires interdisciplinary effort –dealing with a huge amount of information

4 genetics.gsk.com/ graphics/dna-big.gif Four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T) Bindings: –A with T (weaker), C with G (stronger) Forms a double helix – each strand is linked via sugar-phosphate bonds (strong), strands are linked via hydrogen bonds (weak) Genome is the part of DNA that encodes proteins: –…AACTCGCATCGAACTCTAAGTC… DNA Structure

5 Sidenote: Sequence Alignment Perhaps the most fundamental operation in bioinformatics –used to decide if two genes or proteins are related by function, structure, or evolutionary history –can identify patterns of conservation and variability Performs pairwise matching between characters of each sequence One place where it is useful: SNP (single-nucleotide polymorphism) detection –SNPs may indicate a disease development (myocardial diseases, arthritis, etc. have been associated with SNPs) Sequence alignment is the first student presentation topic in the series (HMM, dynamic programming, Bayesian methods)

6 Details of the information flow Replication of DNA –{A,C,G,T} to {A, C, G,T} Transcription of DNA to mRNA –{A,C,G,T} to {A, C, G,U} Translation of mRNA to proteins –{A,C,G,U} to {20 amino-acids} http://www-stat.stanford.edu/~susan/courses/s166/central.gif

7 Genes can be turned on and off

8 Microarray Technology A medium for matching known and unknown DNA samples based on hybridization (base-pairing) Two major applications –identification of a sequence (gene or gene mutation) –determination of expression level (abundance) of genes Enables massively parallel gene expression studies Two types of molecules take part in the experiments: –probes, orderly arranged on an array –targets, the unknown samples to be detected

9 “Traditionally”, there are two formats: –probe cDNA immobilized to a solid surface using robot spotting and exposed to a set of targets, and –an array of oligonucleotide probes synthesized on chip (via, e.g., photolithography) Targets are typically fluorescently labeled cDNA molecules obtained from mRNA samples –hybridize to their complementary probes –image readout Types of Microarrays

10 http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg Illustration: DNA microarray

11 Sample Microarray Readout

12 Some Design Issues Hybridization is binding of a target to its perfect complement However, when a probe differs from a target by a small number of bases, it still may bind This non-specific binding (cross-hybridization) is a source of measurement noise In special cases (e.g., arrays for gene detection), designer has a lot of control over the landscape of the probes on the array Second topic for presentations considers a combinatorial design of such arrays [How to deal with cross-hybridization on arrays used for expression level measurements is the topic of the third lecture.]

13 Clustering Gene Expression Profiles Microarrays measure expression levels of thousands of gene simultaneously For instance, we might take samples at different times during a biological process Cluster data in the expression level space –relatedness in biological function often implies similarity in expression behavior (and vice versa) –similar expression behavior indicates co-expression Clustering of expression level data is one of the topics (traditional statistical methods but also graph-theoretic approach, information-theoretic approach, etc.)

14 Example of Clustering http://www.genomatix.de/gif/node43_documentation.gif Rows: various gene expression levels Columns: Time progression So-called hierarchical clustering

15 Co-regulated genes Co-expressed genes may be co-regulated –a combination of transcription factors (activating or repressing proteins) regulates genes jointly Finding binding sites (control regions) of co-regulated genes is another topic HMM, probabilistic methods (EM, Gibbs sampling)

16 Genetic Regulatory Networks Proteins take part in the gene regulation –feedback loop in the Central Dogma information flow Thus to fully understand gene regulation, we need to consider interactions –DNA, RNA, proteins, small molecules Requires network formalism –directed graphs, Boolean networks, Bayesian networks, differential equations etc. Explore some of these models in gene regulation context

17 An Illustration of a Regulatory Network

18 Protein Translation/Folding [Should time permit.] Sequence-structure relationship will play very important role in the postgenomic era –potential great impact on genetics and pharmaceutical chemistry, protein design –diseases such as Alzheimer’s are believed to be related to protein misfolding Computationally very hard –parallel, distributed computing

19 Genomic data fusion Consider the problem of classification of a protein and assume that we know: –original gene sequence encoding the protein –gene expression levels –some of the protein-protein interactions Question: how to combine various types of data to classify the protein The last (right now…) topic of the seminar will be data fusion of the various genomic data listed above –efficient convex optimization based statistical learning algorithm

20 Summary Trying to understand gene regulation Recent technologies revolutionized research –huge amount of data Multidisciplinary; identify opportunities Challenging problems, quite important: –understanding information processes on genetic level gives insights about phenotypic effects (disease) –some of the ultimate goals are molecular diagnostics and creating personalized drugs


Download ppt "EE150a – Genomic Signal and Information Processing Seminar series –lectures on first 3 meetings, followed by students presentations –statistical signal."

Similar presentations


Ads by Google