Download presentation
Presentation is loading. Please wait.
Published byFelicity Knight Modified over 9 years ago
1
Text Mining Three Cases
2
2 Outline Federalist Papers SVDPDF VAERS http://zlin.ba.ttu.edu/sassrc.rar http://zlin.ba.ttu.edu/DMTM9.rar
3
3 Federalist Papers
4
4
5
5 Who wrote The Federalist Papers? Who wrote The Federalist Papers? Hamilton STYLOMETRY: Uniquely identify an author based on the distribution of words in a document. STYLOMETRY: Uniquely identify an author based on the distribution of words in a document. Madison
6
About the Data Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.yale.edu/lawweb/avalon/federal/fed.htm http://www.constitution.org/fed/federa00.htm. http://www.constitution.org/fed/federa00.htm Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.) 6
7
7 Corpus The Federalist Papers corpus is a collection of 85 essays. Terms and Tokens The Federalist Papers taken as a whole contain over 190,000 tokens and approximately 8,800 unique tokens.
8
8 The Federalist Papers Diagram EM Clustering Logistic Regression TERGET: 1 – Madison; 0 – Hamilton; missing - unknown
9
9 Federalist Papers Clusters Cluster 1 Hamilton Madison Unknown 24 1 0 Cluster 2 Hamilton Madison Unknown 27 14 11 These clusters were obtained using numeric inputs derived from text mining. No author information was employed. Of interest is the fact that EM clustering placed all of the unknown essays into the same cluster that contains 14 of the 15 Madison essays.
10
10 Logistic Regression Classification of The Federalist Papers
11
Text Mining Results By Text Mining, the results of Mosteller and Wallace have been matched. The predictions in the second column from the right show the strength of the decision. The record with a predicted value of 0.709119 corresponds to essay 56, so the model thinks that this essay has the weakest association with Madison of all of the unknown essays. Essay 63, with a predicted value of 0.999691, has the strongest association with Madison. All of the essays in question have a stronger association with Madison than Hamilton, hence the classification into the Madison category. 11
12
12 Characteristics of a Document A document consists of letters words sentences paragraphs punctuation possible structural items: chapters, sections. The elements of a document can be counted (for example, the number of characters, words, or sentences) summarized (for example, mean, median, or kurtosis).
13
13 Comparing Two Documents Word Size Sentence Size Paragraph Size Word Freq Sentence Freq Paragraph Freq Doc 1 Doc 2
14
14
15
15 Contingency Table Comparing Essay 1 to Essay 37 continued...
16
16 Contingency Table Comparing Essay 1 to Essay 37
17
17 Text Miner Static Analysis
18
18 Text Miner Interactive Analysis
19
SVDPDF 19
20
20 SAS Education Course Descriptions The data represents a collection of 130 course summaries obtained from http://support.sas.com.http://support.sas.com The original 130 files were PDF files stored in one location on an HTTP server. A SAS DATA step was used to read the files from the server and write them to a local directory. The TMFILTER macro was used to process the PDF files and store the results as a text field in 130 document records in a SAS data set. The final SAS data set was modified to accommodate this demonstration and can be found in DMTM9.SASPDF.
21
21 Static Analysis with SAS Text Miner
22
22 Text Miner Settings
23
23 Interactive Results
24
24 Applications of Concept Lists A company can have specific conceptual goals. For example, are customers concerned about brand integrity quality price features, styles, and selection availability customer support?
25
25 Market Research for Quality What terms are most similar to the term “quality”? –Find Similar –Filter What documents address quality? –Filter on synonyms and similar terms –Find similar documents What secondary concepts reflect information on quality? –SVD coefficients –Concept links
26
VAERS 26
27
27 VAERS VAERS was created by the Food and Drug Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that might be associated with vaccines. No prescription drug or biological product, such as a vaccine, is completely free from side effects. Vaccines protect many people from dangerous illnesses, but vaccines, like drugs, can cause side effects, a small percentage of which may be serious. VAERS is used to continually monitor reports to determine whether any vaccine or vaccine lot has a higher than expected rate of events. Department of Health and Human Services, Public Health Service
28
28 VAERS Data was obtained from http://www.vaers.org/.http://www.vaers.org/ Data was downloaded in September 2002 as a series of CSV files. A SAS DATA step was used to read and process the data. The original data had 131,464 observations and 59 variables. Cleaning and screening reduced the data set to 48,523 observations and 44 variables. The data set has 6 text variables. The original data had 21, but 15 were sparsely populated.
29
29
30
30 VAERS Sample Entries 15 mon. male w/ hx of recurrent ear infections & measles in Feb. 89'. 5Apr89 was given MMR. Within 24 hrs /p vaccine, parents noted hearing deficit, confirmed by physician exam. DEAF Urticaria, wheezy, & periorbital edema which abated /p administration of subcut. epinephrine, Bendryl IV, Solumendrol IV ASTHMA Pt experienced chicken pox from head to toe subsequent to receiving one dose of varicella virus vaccine live. INFECT
31
31 VAERS Text Fields SYMPTOM_TEXT : Full text description of the adverse reaction entered by a medical professional SYM01 : Brief description of primary symptom SYM02-SYM05 : Additional symptoms in decreasing importance
32
32 VAERS Initial Diagram
33
33 Equivalent Terms for Patient
34
34 Property Panel for VAERS Text Miner Analysis
35
35 Interactive Results
36
36 Clusters Window Why only one term when five were requested? …
37
37 Cases with Fever Last 16 entries out of 98
38
38 Headache Terms
39
39 Headache Documents
40
40 Terms Most Similar to Headache
41
41 Documents Most Similar to Headache
42
42 First 11 out of 65 Documents Filtered by Headache Terms
43
43 VAERS Predictive Modeling Diagram
44
44 Logistic Regression Model Effects Plot
45
45 Logistic Regression Lift Plot
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.