7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 This material is based.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Naveen K. Bansal and Prachi Pradeep Dept. of Math., Stat., and Comp. Sci. Marquette University Milwaukee, WI (USA)
1 Review What genes control cell differentiation during development Compare and Contrast How is the way Hox genes are expressed in mice similar and different.
Lesson Overview Lesson Overview Gene Regulation and Expression Lesson Overview 13.4 Gene Regulation and Expression.
2/25/13 - Union University 1 ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas This material.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Gene Expression Chapter 9.
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Basic Biology for CS262 OMKAR DESHPANDE (TA) Overview Structures of biomolecules How does DNA function? What is a gene? How are genes regulated?
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Computational biology seminar
Gene Expression Analysis by SAGE. Gene Expression Some challenges: –Large number of genes How do you keep samples and equipment small and affordable?
1 April, 2005 Chapter C4.1 and C5.1 DNA Microarrays and Cancer.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Chapter 15 Noncoding RNAs. You Must Know The role of noncoding RNAs in control of cellular functions.
RNA.
Analysis of microarray data
DATA MINING Part I IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275,
11/05/05SMU Homecoming1 DATA MINING AND TERRORISM Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275
Control of Gene Expression Eukaryotes. Eukaryotic Gene Expression Some genes are expressed in all cells all the time. These so-called housekeeping genes.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Transfection. What is transfection? Broadly defined, transfection is the process of artificially introducing nucleic acids (DNA or RNA) into cells, utilizing.
Current Topics in Genomics and Epigenomics – Lecture 2.
From motif search to gene expression analysis
8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,
Data Type 1: Microarrays
Finish up array applications Move on to proteomics Protein microarrays.
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Microarrays.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Chapter 25 The RNA World. microRNA Previously thought to be “junk” DNA – Now determined to “code” for other RNA ENCODE project Andrew Fire and Craig Mello.
Eukaryotic Genome & Gene Regulation The entire genome of the eukaryotic organism is present in every cell of the organism. Although all genes are present,
Marco Magistri , Journal Club. A non-coding RNA (ncRNA) is any RNA molecule that is not translated into a protein “Structural genes encode proteins.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
A Structural Analysis of miRNA Margaret H. Dunham, Donya Quick, Yuhang Wang CSE Department Monnie McGee Jim Waddle Statistics Department Biology Department.
Gene Regulations and Mutations
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
CSE 5331/7331 F'071 CSE 5331/7331 Fall 2007 Image Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Control of Gene Expression Chapter Proteins interacting w/ DNA turn Prokaryotic genes on or off in response to environmental changes  Gene Regulation:
Gene Regulation and Expression. Learning Objectives  Describe gene regulation in prokaryotes.  Explain how most eukaryotic genes are regulated.  Relate.
CSE 8331 Spring CSE 8331 Spring 2010 Image Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Gene, MicroArray and GAs Ashish Anand Kanpur Genetic Algorithms Laboratory (KanGAL) IIT Kanpur.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Computational prediction of miRNA and miRNA-disease relationship
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Genomic Signal Processing Dr. C.Q. Chang Dept. of EEE.
Lecture 8 Ch.7 (II) Eukaryotic Gene Regulation. Control of Gene Expression in Eukaryotes: an overview.
Introduction to Oligonucleotide Microarray Technology
Non-Coding RNA Helen Nordquist November 13, 2015.
Microarray: An Introduction
Mestrado Integrado em Medicina Biologia Celular e Molecular II
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
RNAi Overview
A-LEVEL BIOLOGY RNA interference (RNAi)
Eukaryotic Genome & Gene Regulation
Chapter 18 Gene Expression.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Review Warm-Up What is the Central Dogma?
mRNA Degradation and Translation Control
Unit 7: Molecular Genetics
Noncoding RNA roles in Gene Expression
Presentation transcript:

7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas This material is based in part upon work supported by the National Science Foundation under Grant No Some slides used by permission from Dr Eamonn Keogh; Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;

7/10/07 - SEDE'07 2 The 2000 ozone hole over the antarctic seen by EPTOMS

7/10/07 - SEDE'07 3 OBJECTIVE Explore some of the applications of data mining techniques.

7/10/07 - SEDE'07 4 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

7/10/07 - SEDE'07 5 Data Mining Overview nFinding hidden information in a database nFit data to a model nYou must know what you are looking for nYou must know how to look for you

7/10/07 - SEDE'07 6 “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” Description BehaviorAssociations Classification Clustering Link Analysis (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”

7/10/07 - SEDE'07 7 Classification Applications nTeachers classify students’ grades as A, B, C, D, or F. nLetter Recognition nandwriting Recognition nPhishing: viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId= &taxonomyId=82 viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId= &taxonomyId=82 nPluto:

7/10/07 - SEDE'07 8 Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh, Classification Example

7/10/07 - SEDE'07 9 Antenna Length Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh,

7/10/07 - SEDE'07 10 Clustering Applications nTargeted Marketing nDetermining Gene Functionality nIdentifying Species nClustering vs. Classification n No prior knowledge n Number of clusters n Meaning of clusters nUnsupervised learning

7/10/07 - SEDE'

7/10/07 - SEDE'07 12 What is Similarity ? (c) Eamonn Keogh,

7/10/07 - SEDE'07 13 Association Rules Applications nPeople who buy diapers also buy beer nIf gene A is highly expressed in this disease then gene B is also expressed nRelationships between people nwww.amazon.comwww.amazon.com nBook Stores nDepartment Stores nAdvertising nProduct Placement

7/10/07 - SEDE'07 14 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, DILBERT reprinted by permission of United Feature Syndicate, Inc.

7/10/07 - SEDE'07 15 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

7/10/07 - SEDE'07 16

7/10/07 - SEDE'07 17 Fraud Detection nIdentify fraudulent behavior nUsed Extensively in financial, law enforcement, health care, etc. sectors nhttp:// nSPSS: ection.htm ection.htm nNeural Technologies:

7/10/07 - SEDE'07 18 Law Enforcement nIdentify suspect behavior and relationships nI2 Inc. n Investigative analytic/visualization software n nSocial Network Analysis – Analyze patterns of relationships nRelationships: personal, religious, operational, etc.

7/10/07 - SEDE'07 19 Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.

7/10/07 - SEDE'07 20 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

7/10/07 - SEDE'07 21 How Stuff Works, “Facial Recognition,” fworks.com/facial- recognition1.htm

7/10/07 - SEDE'07 22 Facial Recognition nBased upon features in face nConvert face to a feature vector nLess invasive than other biometric techniques nhttp:// nhttp://computer.howstuffworks.com/facial- recognition.htmhttp://computer.howstuffworks.com/facial- recognition.htm nSIMS: ucts.aspx

7/10/07 - SEDE'07 23 (c) Eamonn Keogh,

7/10/07 - SEDE'07 24 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

7/10/07 - SEDE'07 25 Cheating on Multiple Choice Tests nSimilarity between tests based on number of common wrong answers. n(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp ) nThe number of common correct answers is often ignored. nH-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp ): H-H = (Number of exact answers in common) (Number of different answers)

7/10/07 - SEDE'07 26 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

7/10/07 - SEDE'07 27 No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

7/10/07 - SEDE'07 28 Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

7/10/07 - SEDE'07 29 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

7/10/07 - SEDE'07 30 DNA nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides nTwo strands bound together d=63

7/10/07 - SEDE'07 31 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDE CCUGAGCCAACUAUUGAUGAA Central Dogma: DNA -> RNA -> Protein chapter 6; Gene Prediction

7/10/07 - SEDE'07 32 miRNA nShort (20-25nt) sequence of noncoding RNA nKnown since 1993 but significance not widely appreciated until 2001 nImpact / Prevent translation of mRNA nGenerally reduce protein levels without impacting mRNA levels (animal cells) nFunctions n Causes some cancers n Guide embryo development n Regulate cell Differentiation n Associated with HIV n …

7/10/07 - SEDE'07 33 Questions nIf each cell in an organism contains the same DNA – n How does each cell behave differently? n Why do cells behave differently during childhood/? n What causes some cells to act differently – such as during disease? nDNA contains many genes, but only a few are being transcribed – why? nOne answer - miRNA

7/10/07 - SEDE'

7/10/07 - SEDE'07 35 Human Genome nScientists originally thought there would be about 100,000 genes nAppear to be about 20,000 nWHY? nAlmost identical to that of Chimps. What makes the difference? nVisualization from UCR dnaQT.mov nAnswers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

7/10/07 - SEDE'07 36 RNAi – Nobel Prize in Medicine 2006 Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA siRNA may be artificially added to cell! Image source: Advanced Information, Image 3

7/10/07 - SEDE'07 37 Computer Science & Bioinformatics nAlgorithms nData Structures nImproving efficiency nData Mining nBiologists don’t usually understand or even appreciate what Computer Science can do nIssues: n Scalability n Fuzzy nWe will look at: n Microarray Clustering n TCGR

7/10/07 - SEDE'07 38 Affymetrix GeneChip ® Array

7/10/07 - SEDE'07 39 Microarray Data Analysis nEach probe location associated with gene nMeasure the amount of mRNA nColor indicates degree of gene expression nCompare different samples (normal/disease) nTrack same sample over time nQuestions n Which genes are related to this disease? n Which genes behave in a similar manner? n What is the function of a gene? nClustering n Hierarchical n K-means

7/10/07 - SEDE'07 40 Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, , January 20, 2004

7/10/07 - SEDE'07 41 miRNA Research Issues nPredict / Find miRNA in genomic sequence nPredict miRNA targets nIdentify miRNA functions

7/10/07 - SEDE'07 42 Temporal CGR (TCGR) n2D Array n Each Row represents counts for a particular window in sequence First row – first window Last row – last window We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern length

7/10/07 - SEDE'07 43 TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3

7/10/07 - SEDE'07 44 TCGR – Mature miRNA (Window=5; Pattern=3) All Mature Mus Musculus Homo Sapiens C Elegans ACG CGCGCGUCG

7/10/07 - SEDE'07 45 P O S I T I VE NE GA T I VE TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

7/10/07 - SEDE'07 46 PO S I T I VE NE GA T I VE TCGRs for Xue Test Data

7/10/07 - SEDE'07 47 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

7/10/07 - SEDE'07 48 Conclusions nNot magic nDoesn’t work for all applications nStock Market Prediction nIssues n Privacy n Data nHere are some infamous examples of failed data mining applications

7/10/07 - SEDE'07 49

7/10/07 - SEDE'07 50 Dallas Morning News October 7, 2005

7/10/07 - SEDE'

7/10/07 - SEDE'07 52 BIG BROTHER ? nTotal Information Awareness n n n nTerror Watch List n 511_8047_tc_210.htm 511_8047_tc_210.htm n n n nCAPPS n n n n

7/10/07 - SEDE'07 53

7/10/07 - SEDE'07 54

7/10/07 - SEDE'07 55