7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas This material is based in part upon work supported by the National Science Foundation under Grant No Some slides used by permission from Dr Eamonn Keogh; Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;
7/10/07 - SEDE'07 2 The 2000 ozone hole over the antarctic seen by EPTOMS
7/10/07 - SEDE'07 3 OBJECTIVE Explore some of the applications of data mining techniques.
7/10/07 - SEDE'07 4 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions
7/10/07 - SEDE'07 5 Data Mining Overview nFinding hidden information in a database nFit data to a model nYou must know what you are looking for nYou must know how to look for you
7/10/07 - SEDE'07 6 “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” Description BehaviorAssociations Classification Clustering Link Analysis (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”
7/10/07 - SEDE'07 7 Classification Applications nTeachers classify students’ grades as A, B, C, D, or F. nLetter Recognition nandwriting Recognition nPhishing: viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId= &taxonomyId=82 viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId= &taxonomyId=82 nPluto:
7/10/07 - SEDE'07 8 Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh, Classification Example
7/10/07 - SEDE'07 9 Antenna Length Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh,
7/10/07 - SEDE'07 10 Clustering Applications nTargeted Marketing nDetermining Gene Functionality nIdentifying Species nClustering vs. Classification n No prior knowledge n Number of clusters n Meaning of clusters nUnsupervised learning
7/10/07 - SEDE'
7/10/07 - SEDE'07 12 What is Similarity ? (c) Eamonn Keogh,
7/10/07 - SEDE'07 13 Association Rules Applications nPeople who buy diapers also buy beer nIf gene A is highly expressed in this disease then gene B is also expressed nRelationships between people nwww.amazon.comwww.amazon.com nBook Stores nDepartment Stores nAdvertising nProduct Placement
7/10/07 - SEDE'07 14 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, DILBERT reprinted by permission of United Feature Syndicate, Inc.
7/10/07 - SEDE'07 15 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions
7/10/07 - SEDE'07 16
7/10/07 - SEDE'07 17 Fraud Detection nIdentify fraudulent behavior nUsed Extensively in financial, law enforcement, health care, etc. sectors nhttp:// nSPSS: ection.htm ection.htm nNeural Technologies:
7/10/07 - SEDE'07 18 Law Enforcement nIdentify suspect behavior and relationships nI2 Inc. n Investigative analytic/visualization software n nSocial Network Analysis – Analyze patterns of relationships nRelationships: personal, religious, operational, etc.
7/10/07 - SEDE'07 19 Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.
7/10/07 - SEDE'07 20 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions
7/10/07 - SEDE'07 21 How Stuff Works, “Facial Recognition,” fworks.com/facial- recognition1.htm
7/10/07 - SEDE'07 22 Facial Recognition nBased upon features in face nConvert face to a feature vector nLess invasive than other biometric techniques nhttp:// nhttp://computer.howstuffworks.com/facial- recognition.htmhttp://computer.howstuffworks.com/facial- recognition.htm nSIMS: ucts.aspx
7/10/07 - SEDE'07 23 (c) Eamonn Keogh,
7/10/07 - SEDE'07 24 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions
7/10/07 - SEDE'07 25 Cheating on Multiple Choice Tests nSimilarity between tests based on number of common wrong answers. n(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp ) nThe number of common correct answers is often ignored. nH-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp ): H-H = (Number of exact answers in common) (Number of different answers)
7/10/07 - SEDE'07 26 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07 - SEDE'07 27 No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07 - SEDE'07 28 Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
7/10/07 - SEDE'07 29 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions
7/10/07 - SEDE'07 30 DNA nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides nTwo strands bound together d=63
7/10/07 - SEDE'07 31 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDE CCUGAGCCAACUAUUGAUGAA Central Dogma: DNA -> RNA -> Protein chapter 6; Gene Prediction
7/10/07 - SEDE'07 32 miRNA nShort (20-25nt) sequence of noncoding RNA nKnown since 1993 but significance not widely appreciated until 2001 nImpact / Prevent translation of mRNA nGenerally reduce protein levels without impacting mRNA levels (animal cells) nFunctions n Causes some cancers n Guide embryo development n Regulate cell Differentiation n Associated with HIV n …
7/10/07 - SEDE'07 33 Questions nIf each cell in an organism contains the same DNA – n How does each cell behave differently? n Why do cells behave differently during childhood/? n What causes some cells to act differently – such as during disease? nDNA contains many genes, but only a few are being transcribed – why? nOne answer - miRNA
7/10/07 - SEDE'
7/10/07 - SEDE'07 35 Human Genome nScientists originally thought there would be about 100,000 genes nAppear to be about 20,000 nWHY? nAlmost identical to that of Chimps. What makes the difference? nVisualization from UCR dnaQT.mov nAnswers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)
7/10/07 - SEDE'07 36 RNAi – Nobel Prize in Medicine 2006 Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA siRNA may be artificially added to cell! Image source: Advanced Information, Image 3
7/10/07 - SEDE'07 37 Computer Science & Bioinformatics nAlgorithms nData Structures nImproving efficiency nData Mining nBiologists don’t usually understand or even appreciate what Computer Science can do nIssues: n Scalability n Fuzzy nWe will look at: n Microarray Clustering n TCGR
7/10/07 - SEDE'07 38 Affymetrix GeneChip ® Array
7/10/07 - SEDE'07 39 Microarray Data Analysis nEach probe location associated with gene nMeasure the amount of mRNA nColor indicates degree of gene expression nCompare different samples (normal/disease) nTrack same sample over time nQuestions n Which genes are related to this disease? n Which genes behave in a similar manner? n What is the function of a gene? nClustering n Hierarchical n K-means
7/10/07 - SEDE'07 40 Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, , January 20, 2004
7/10/07 - SEDE'07 41 miRNA Research Issues nPredict / Find miRNA in genomic sequence nPredict miRNA targets nIdentify miRNA functions
7/10/07 - SEDE'07 42 Temporal CGR (TCGR) n2D Array n Each Row represents counts for a particular window in sequence First row – first window Last row – last window We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern length
7/10/07 - SEDE'07 43 TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3
7/10/07 - SEDE'07 44 TCGR – Mature miRNA (Window=5; Pattern=3) All Mature Mus Musculus Homo Sapiens C Elegans ACG CGCGCGUCG
7/10/07 - SEDE'07 45 P O S I T I VE NE GA T I VE TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
7/10/07 - SEDE'07 46 PO S I T I VE NE GA T I VE TCGRs for Xue Test Data
7/10/07 - SEDE'07 47 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions
7/10/07 - SEDE'07 48 Conclusions nNot magic nDoesn’t work for all applications nStock Market Prediction nIssues n Privacy n Data nHere are some infamous examples of failed data mining applications
7/10/07 - SEDE'07 49
7/10/07 - SEDE'07 50 Dallas Morning News October 7, 2005
7/10/07 - SEDE'
7/10/07 - SEDE'07 52 BIG BROTHER ? nTotal Information Awareness n n n nTerror Watch List n 511_8047_tc_210.htm 511_8047_tc_210.htm n n n nCAPPS n n n n
7/10/07 - SEDE'07 53
7/10/07 - SEDE'07 54
7/10/07 - SEDE'07 55