Download presentation
Presentation is loading. Please wait.
Published byDiana Jenkins Modified over 9 years ago
2
2/25/13 - Union University 1 ADVENTURES IN DATA MINING Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@lyle.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 and NIH Grant No.1R21HG005912-01A1 Some slides used by permission from Dr Eamonn Keogh; Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; eamonn@cs.ucr.edu ACM Distinguished Speakers Program
3
2/25/13 - Union University 2 The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
4
Data Mining Outline nIntroduction nTechniques n Classification n Clustering n Association Rules nExamples 2/25/13 - Union University 3 Explore some interesting data mining applications
5
Introduction nData is growing at a phenomenal rate nUsers expect more sophisticated information nHow? 2/25/13 - Union University 4 UNCOVER HIDDEN INFORMATION DATA MINING
6
But it isn’t Magic nYou must know what you are looking for nYou must know how to look for you 2/25/13 - Union University 5 Suppose you knew that a specific cave had gold: What would you look for? How would you look for it? Might need an expert miner
7
CLASSIFICATION nAssign data into predefined groups or classes. 2/25/13 - Union University 6
8
“If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” 2/25/13 - Union University 7 Description BehaviorAssociations Classification Clustering Link Analysis (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”
9
Classification Ex: Grading 2/25/13 - Union University 8 >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D
10
2/25/13 - Union University 9 Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh, eamonn@cs.ucr.edu
11
2/25/13 - Union University 10 Insect ID AbdomenLengthAntennaeLength Insect Class 12.75.5Grasshopper 28.09.1Katydid 30.94.7Grasshopper 41.13.1Grasshopper 55.48.5Katydid 62.91.9Grasshopper 76.16.6Katydid 80.51.0Grasshopper 98.36.6Katydid 108.14.7Katydid 11 5.1 7.0 ??????? ??????? The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Given a training database predict the class label of a previously unseen instance previously unseen instance = (c) Eamonn Keogh, eamonn@cs.ucr.edu
12
2/25/13 - Union University 11 Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh, eamonn@cs.ucr.edu
13
2/25/13 - Union University 12 How Stuff Works, “Facial Recognition,” http://computer.howstuf fworks.com/facial- recognition1.htm http://computer.howstuf fworks.com/facial- recognition1.htm
14
2/25/13 - Union University 13 Facial Recognition (c) Eamonn Keogh, eamonn@cs.ucr.edu
15
2/25/13 - Union University 14 Handwriting Recognition George Washington Manuscript 0 50100150200250300350400450 0 0.5 1 (c) Eamonn Keogh, eamonn@cs.ucr.edu
16
Rare Event Detection 2/25/13 - Union University 15
17
2/25/13 - Union University 16
18
2/25/13 - Union University 17 Dallas Morning News October 7, 2005
19
© Prentice Hall 18 Classification Performance True Positive True NegativeFalse Positive False Negative
20
Behavior Based Classification/Prediction nCredit Card Fraud Detection nCredit Score nHome Mortgage Approval 2/25/13 - Union University 19
21
CLUSTERING nPartition data into previously undefined groups. 2/25/13 - Union University 20
22
2/25/13 - Union University 21 http://149.170.199.144/multivar/ca.htm
23
2/25/13 - Union University 22 What is Similarity? (c) Eamonn Keogh, eamonn@cs.ucr.edu
24
Two Types of Clustering 2/25/13 - Union University 23 Hierarchical Partitional (c) Eamonn Keogh, eamonn@cs.ucr.edu
25
Hierarchical Clustering Example Iris Data Set 2/25/13 - Union University 24 Setosa Versicolor Virginica The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188. Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster. http://www.cs.umd.edu/hcil/multi-cluster
26
ASSOCIATION RULES/ LINK ANALYSIS nFind relationships between data 2/25/13 - Union University 25
27
ASSOCIATION RULES EXAMPLES nPeople who buy diapers also buy beer nIf gene A is highly expressed in this disease then gene A is also expressed nRelationships between people nBook Stores nDepartment Stores nAdvertising nProduct Placement nhttp://www.amazon.com/Data-Mining-Introductory-Advanced- Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=123 5564485&sr=1-1http://www.amazon.com/Data-Mining-Introductory-Advanced- Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=123 5564485&sr=1-1 2/25/13 - Union University 26
28
2/25/13 - Union University 27 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.
29
Data Mining Outline nIntroduction nTechniques nExamples n Vision Mining n Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…) n Bioinformatics 2/25/13 - Union University 28
30
Vision Mining nLicense Plate Recognition n Red Light Cameras n Toll Booths n http://www.licenseplaterecognition.com/ http://www.licenseplaterecognition.com/ nComputer Vision n http://www.eecs.berkeley.edu/Research/Proj ects/CS/vision/shape/vid/ http://www.eecs.berkeley.edu/Research/Proj ects/CS/vision/shape/vid/ 2/25/13 - Union University 29
31
2/25/13 - Union University 30 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
32
No/Little Cheating 2/25/13 - Union University 31 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
33
Rampant Cheating 2/25/13 - Union University 32 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.
34
2/25/13 - Union University 33 Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.
35
Arnet Miner nhttp://arnetminer.org/http://arnetminer.org/ 2/25/13 - Union University 34
36
DNA nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides nTwo strands bound together 2/25/13 - Union University 35 http://www.visionlearning.com/library/module_viewer.php?mi d=63
37
Central Dogma: DNA -> RNA -> Protein 2/25/13 - Union University 36 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA Amino Acid CCUGAGCCAACUAUUGAUGAA www.bioalgorithms.infowww.bioalgorithms.info; chapter 6; Gene Prediction
38
Human Genome nScientists originally thought there would be about 100,000 genes nAppear to be about 20,000 nWHY? nAlmost identical to that of Chimps. What makes the difference? nAnswers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) 2/25/13 - Union University 37
39
RNAi – Nobel Prize in Medicine 2006 2/25/13 - Union University 38 Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA siRNA may be artificially added to cell! Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3 http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html
40
miRNA nShort (20-25nt) sequence of noncoding RNA nKnown since 1993 but significance not widely appreciated until 2001 nImpact / Prevent translation of mRNA nGenerally reduce protein levels without impacting mRNA levels (animal cells) nFunctions n Causes some cancers n Guide embryo development n Regulate cell Differentiation n Associated with HIV n … 2/25/13 - Union University 39
41
TCGR – Mature miRNA (Window=5; Pattern=3) 2/25/13 - Union University 40 All Mature Mus Musculus Homo Sapiens C Elegans ACG CGCGCGUCG
42
TCGRs for Xue Training Data 2/25/13 - Union University 41 P O S I T I VE NE GA T I VE C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
43
2/25/13 - Union University 42 Affymetrix GeneChip ® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
44
BIG BROTHER ? nTotal Information Awareness n http://en.wikipedia.org/wiki/Information_Awareness_Offi ce http://en.wikipedia.org/wiki/Information_Awareness_Offi ce nTerror Watch List n http://www.businessweek.com/technology/content/may2 005/tc20050511_8047_tc_210.htm http://www.businessweek.com/technology/content/may2 005/tc20050511_8047_tc_210.htm n http://www.theregister.co.uk/2004/08/19/senator_on_te rror_watch/ http://www.theregister.co.uk/2004/08/19/senator_on_te rror_watch/ n http://blog.wired.com/27bstroke6/2008/02/us-terror- watch.html http://blog.wired.com/27bstroke6/2008/02/us-terror- watch.html nCAPPS n http://en.wikipedia.org/wiki/CAPPS http://en.wikipedia.org/wiki/CAPPS 2/25/13 - Union University 43
45
2/25/13 - Union University 44 http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
46
2/25/13 - Union University 45
47
My DM Toolbelt nC, C++ nPerl, Ruby nWeka nR, SAS nExcel, XLMiner nVi, word, … nGrep, sed, … 2/25/13 - Union University 46
48
2/25/13 - Union University 47
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.