2/25/13 - Union University 1 ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas This material is based in part upon work supported by the National Science Foundation under Grant No and NIH Grant No.1R21HG A1 Some slides used by permission from Dr Eamonn Keogh; Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; ACM Distinguished Speakers Program
2/25/13 - Union University 2 The 2000 ozone hole over the antarctic seen by EPTOMS
Data Mining Outline nIntroduction nTechniques n Classification n Clustering n Association Rules nExamples 2/25/13 - Union University 3 Explore some interesting data mining applications
Introduction nData is growing at a phenomenal rate nUsers expect more sophisticated information nHow? 2/25/13 - Union University 4 UNCOVER HIDDEN INFORMATION DATA MINING
But it isn’t Magic nYou must know what you are looking for nYou must know how to look for you 2/25/13 - Union University 5 Suppose you knew that a specific cave had gold: What would you look for? How would you look for it? Might need an expert miner
CLASSIFICATION nAssign data into predefined groups or classes. 2/25/13 - Union University 6
“If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” 2/25/13 - Union University 7 Description BehaviorAssociations Classification Clustering Link Analysis (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”
Classification Ex: Grading 2/25/13 - Union University 8 >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D
2/25/13 - Union University 9 Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh,
2/25/13 - Union University 10 Insect ID AbdomenLengthAntennaeLength Insect Class Grasshopper Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydid ??????? ??????? The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh,
2/25/13 - Union University 11 Antenna Length Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh,
2/25/13 - Union University 12 How Stuff Works, “Facial Recognition,” fworks.com/facial- recognition1.htm fworks.com/facial- recognition1.htm
2/25/13 - Union University 13 Facial Recognition (c) Eamonn Keogh,
2/25/13 - Union University 14 Handwriting Recognition George Washington Manuscript (c) Eamonn Keogh,
Rare Event Detection 2/25/13 - Union University 15
2/25/13 - Union University 16
2/25/13 - Union University 17 Dallas Morning News October 7, 2005
© Prentice Hall 18 Classification Performance True Positive True NegativeFalse Positive False Negative
Behavior Based Classification/Prediction nCredit Card Fraud Detection nCredit Score nHome Mortgage Approval 2/25/13 - Union University 19
CLUSTERING nPartition data into previously undefined groups. 2/25/13 - Union University 20
2/25/13 - Union University 21
2/25/13 - Union University 22 What is Similarity? (c) Eamonn Keogh,
Two Types of Clustering 2/25/13 - Union University 23 Hierarchical Partitional (c) Eamonn Keogh,
Hierarchical Clustering Example Iris Data Set 2/25/13 - Union University 24 Setosa Versicolor Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland,
ASSOCIATION RULES/ LINK ANALYSIS nFind relationships between data 2/25/13 - Union University 25
ASSOCIATION RULES EXAMPLES nPeople who buy diapers also buy beer nIf gene A is highly expressed in this disease then gene A is also expressed nRelationships between people nBook Stores nDepartment Stores nAdvertising nProduct Placement nhttp:// Topics/dp/ /ref=sr_1_1?ie=UTF8&s=books&qid= &sr=1-1http:// Topics/dp/ /ref=sr_1_1?ie=UTF8&s=books&qid= &sr=1-1 2/25/13 - Union University 26
2/25/13 - Union University 27 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, DILBERT reprinted by permission of United Feature Syndicate, Inc.
Data Mining Outline nIntroduction nTechniques nExamples n Vision Mining n Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…) n Bioinformatics 2/25/13 - Union University 28
Vision Mining nLicense Plate Recognition n Red Light Cameras n Toll Booths n nComputer Vision n ects/CS/vision/shape/vid/ ects/CS/vision/shape/vid/ 2/25/13 - Union University 29
2/25/13 - Union University 30 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
No/Little Cheating 2/25/13 - Union University 31 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
Rampant Cheating 2/25/13 - Union University 32 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
2/25/13 - Union University 33 Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.
Arnet Miner nhttp://arnetminer.org/ 2/25/13 - Union University 34
DNA nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides nTwo strands bound together 2/25/13 - Union University 35 d=63
Central Dogma: DNA -> RNA -> Protein 2/25/13 - Union University 36 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA Amino Acid CCUGAGCCAACUAUUGAUGAA chapter 6; Gene Prediction
Human Genome nScientists originally thought there would be about 100,000 genes nAppear to be about 20,000 nWHY? nAlmost identical to that of Chimps. What makes the difference? nAnswers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) 2/25/13 - Union University 37
RNAi – Nobel Prize in Medicine /25/13 - Union University 38 Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA siRNA may be artificially added to cell! Image source: Advanced Information, Image 3
miRNA nShort (20-25nt) sequence of noncoding RNA nKnown since 1993 but significance not widely appreciated until 2001 nImpact / Prevent translation of mRNA nGenerally reduce protein levels without impacting mRNA levels (animal cells) nFunctions n Causes some cancers n Guide embryo development n Regulate cell Differentiation n Associated with HIV n … 2/25/13 - Union University 39
TCGR – Mature miRNA (Window=5; Pattern=3) 2/25/13 - Union University 40 All Mature Mus Musculus Homo Sapiens C Elegans ACG CGCGCGUCG
TCGRs for Xue Training Data 2/25/13 - Union University 41 P O S I T I VE NE GA T I VE C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
2/25/13 - Union University 42 Affymetrix GeneChip ® Array
BIG BROTHER ? nTotal Information Awareness n ce ce nTerror Watch List n 005/tc _8047_tc_210.htm 005/tc _8047_tc_210.htm n rror_watch/ rror_watch/ n watch.html watch.html nCAPPS n 2/25/13 - Union University 43
2/25/13 - Union University 44
2/25/13 - Union University 45
My DM Toolbelt nC, C++ nPerl, Ruby nWeka nR, SAS nExcel, XLMiner nVi, word, … nGrep, sed, … 2/25/13 - Union University 46
2/25/13 - Union University 47