Download presentation
Presentation is loading. Please wait.
1
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006
2
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 8/21/2006 Introduction Administrative Register for 3 hours of credit
3
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 8/21/2006 Introduction Me Luke Huan, assistant prof. in Electrical Engineering & Computer Science Homepage: http://people.eecs.ku.edu/~jhuan/http://people.eecs.ku.edu/~jhuan/ Office: 2304 Eaton Hall Email: jhuan@eecs.ku.edu Office hour: 10:00 – 11:00am Monday and Wednesday
4
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 8/21/2006 Introduction My Lecture Style I may tend to talk fast, especially when excited Class materials are highly interdisciplinary Use your questions to slow me down Ask for clarification, repetition of a strange phrase, jargons “If in doubt, speak it out”
5
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 8/21/2006 Introduction You Introduction: Who you are What department you are in Why you are taking the course
6
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 8/21/2006 Introduction Outline for Today What is mining biological data? What is this course about? Course home page Course references Paper presentation Final project Grading Forward class reviewing
7
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 8/21/2006 Introduction What is Mining Biological Data Goal: understanding the structure of biological data Patterns Descriptive models Predictive models Challenges: What is the nature of the data? What are the computational tasks? How to break a task into a group of computational components? How to evaluate the computational results? Applications Experimental design and hypothesis generation Synthesis novel proteins Drug design …
8
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 8/21/2006 Introduction What is this Course About? Learning… Problems in mining biological data Available techniques, their pros and cons How to combine techniques together Enough perception to avoid pitfalls Practicing… To present recent papers on a selected topic To work on a project that may involve A domain expert, A driving biological problem, and The development of new data mining techniques
9
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 8/21/2006 Introduction Class Information Class Homepage: http://people.eecs.ku.edu/~jhuan/fall06.htmlhttp://people.eecs.ku.edu/~jhuan/fall06.html Meeting time: 9:00 – 9:45 Monday, Wednesday, Friday Meeting place: Eaton Hall 2001 Prerequisite: none
10
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 8/21/2006 Introduction Textbook & References Textbook: none References Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 2001. (ISBN:1-55860-489-8) The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5) Bioinformatics: Genes, Proteins, and Computers, edited by Christine Orengo, David Jones, Janet Thornton, Bios Scientific Publishers, 2003. (ISBN: 1-85996-0545)
11
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 8/21/2006 Introduction Paper Presentation One per student Research paper(s) List of recommendations will be posted at the class webpage a week from now Your own pick (upon approval) Three parts Review the goal of the paper(s) Discuss the research challenges Present the techniques and comment on their pros and cons Questions and comments from audience Extra credit for active participants of class discussions Order of presentation: first come first pick Please send in your choice of paper by September 1st.
12
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 8/21/2006 Introduction Final Project Project (due Nov. 27th) One project I will post some suggestions at class website. I am soliciting projects from researchers on campus You are welcome to propose your own Discuss with me before you start Checkpoints Proposal: title and goal (due Sep. 8th) Background and related work (due Sep. 29th) Outline of approach (due Oct. 20th) Implementation & Evaluation (due Nov. 10th) Class demo (due Nov. 27th)
13
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 8/21/2006 Introduction Grading Grading scheme No homework No exam Paper presentation and discussion45% Project45% Attendance and Participation10%
14
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 8/21/2006 Introduction Forward Class Reviewing This is for overview, not content Don’t worry if you do not understand some of the words, that’s why you want to take this class. Gives an idea of what is coming Order of presentation might be shuffled to accommodate everyone’s schedule Topics may be adjusted with progresses of the class
15
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 8/21/2006 Introduction Week 1: Pattern Mining Frequent patterns: finding regularities in data Frequent patterns (set of items) are one that occur frequently in a data set Can we automatically profile customers? What products are often purchased together? IDItems bought 100f, a, c, d, g, I, m, p 200a, b, c, f, l,m, o 300b, f, h, j, o 400b, c, k, s, p 500a, f, c, e, l, p, m, n One hypothesis: {a, c} {m} Customer Shopping basket
16
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 8/21/2006 Introduction Week 2: Advanced Pattern Mining Reducing number of patterns Maximal patterns and closed patterns Constraint-based mining Patterns with concept hierarchy Patterns in quantitative data Correlation vs. association
17
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 8/21/2006 Introduction Week 3: Mining Microarray Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “Comprehensive Identification of Cell Cycle- regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 9, 3273-3297. CH1ICH1BCH1DCH2ICH2B CTFC343922844108280228 VPS8401281120275298 EFB131828037277215 SSA1401292109580238 FUN1428572852576271226 SP0722829048285224 MDM10538272266277236 CYS332228841278219 DEP131227240273232 NTG132929633274228
18
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 8/21/2006 Introduction Week 4: Patterns in Sequences, Trees, and Graphs G1G1 p2p2 p5p5 a b b d y x y y y p1p1 p3p3 p4p4 c a b b y x y G2G2 q1q1 q3q3 q2q2 a b b y y G3G3 s1s1 s3s3 s2s2 c s4s4 y f = 3/3 P1P1 a b y y f=2/3 a b b y x y P3P3 a b b y x P2P2 P4P4 b c y b y f=3/3 f=2/3 P5P5 b b x P6P6 a = 2/3 b
19
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 8/21/2006 Introduction Week 5: Pattern Discovery in Biomolecules Protein A sequence from 20 amino acids Adopts a stable 3D structure that can be measured experimentally Lys Gly LeuValAlaHis Oxygen Nitrogen Carbon Sulfur Ribbon Space filling Cartoon Surface
20
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 8/21/2006 Introduction Week 6: Descriptive Models Group objects into clusters Ones in the same cluster are similar Ones in different clusters are dissimilar Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers
21
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 8/21/2006 Introduction Week 7: Subspace Clustering Movie 1Movie 2Movie 3Movie 4Movie 5Movie 6Movie 7 Viewer 112435 Viewer 24671 Viewer 323463 Viewer 43457 Viewer 55534 Movie 1Movie 2Movie 3Movie 4Movie 5Movie 6Movie 7 Viewer 112435 Viewer 24671 Viewer 323463 Viewer 43457 Viewer 55534
22
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 8/21/2006 Introduction Week 7: Subspace Clustering
23
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 8/21/2006 Introduction Week 8: Mining Microarray (II) Apply subspace clustering to microarray analysis Find groups of genes that are co-regulated May integrate data from protein sequences and functional description of genes Applying subgraph mining to microarray analysis
24
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 8/21/2006 Introduction Week 9: Predictive Models Two-class version: Using “training data” from Class +1 and Class -1 Develop a “rule” for assigning new data to a Class Slides from J.S. Marron in Statistics at UNC
25
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 8/21/2006 Introduction Week 10: Classification Algorithms and Applications Decision tree Fishers linear discrimination method Kernel methods
26
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 8/21/2006 Introduction Week 11: Text Mining, Gene Ontology, Data Management Ontology seeks to describe or posit the basic categories and relationships of being or existence to define entities and types of entities within its framework. Ontology can be said to study conceptions of reality (Wikipedia).basic categoriesentitiestypes of entitiesreality GO is a database of terms for genes Terms are connected as a directed acyclic graph Levels represent specifity of the terms (not normalized) GO contains three different sub-ontologies: Molecular function Biological process Cellular component
27
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 8/21/2006 Introduction Week 12: Systems Biology & Proteomics Part of the biological system in a cell at the molecular level Source: http://www.ircs.upenn.edu/modeling2001/,http://www.ircs.upenn.edu/modeling2001/ A proteome is the set of all proteins in an organism
28
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 8/21/2006 Introduction Growth of Known Structures in Protein Data Bank (PDB) Year # of structures 35,000 Week 13: Analyzing Biological Networks Biological networks pose serious challenges and opportunities for the data mining research in computer science Large volume of data Heterogeneous data types Gary D. Bader & Christopher W.V. Hogue, Nature Biotechnology 20, 991 - 997 (2002) Protein-protein interaction in yeast
29
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 8/21/2006 Introduction Week 14: bio-Data Integration Data are collected from many different sources Each piece of data describes part of a complicated (and not directly observable) biological process Combine data together to achieve better understanding and better prediction
30
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 8/21/2006 Introduction Week 15, 16: Project Presentation Check what you have learned from the class Celebrate the hard work!
31
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 8/21/2006 Introduction Further References Data mining Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, IEEE- TKDD Bioinformatics Conferences: ISMB, RECOMB, PSB, CSB, BIBE, etc. Journals: Bioinformatics, J. of Computational Biology, etc.
32
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 8/21/2006 Introduction Further References AI & Machine Learning Conferences: Machine learning (ICML), AAAI, IJCAI, etc. Journals: Machine Learning, Artificial Intelligence, etc. Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Database systems Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, Journals: ACM-TODS, IEEE-TKDE etc. Visualization Conference proceedings: IEEE Visualization, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.
33
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 8/21/2006 Introduction Mining Protein Structure Space http://www.nigms.nih.gov/psi/ Year Growth of the Protein Folds in the Structural Classification of Proteins Database (SCOP) # of folds Growth of Known Structures in Protein Data Bank (PDB) Year # of structures
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.