Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS

Similar presentations


Presentation on theme: "Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS"— Presentation transcript:

1 Text Mining Three Cases

2 2 Outline Federalist Papers SVDPDF VAERS http://zlin.ba.ttu.edu/sassrc.rar http://zlin.ba.ttu.edu/DMTM9.rar

3 3 Federalist Papers

4 4

5 5 Who wrote The Federalist Papers? Who wrote The Federalist Papers? Hamilton STYLOMETRY: Uniquely identify an author based on the distribution of words in a document. STYLOMETRY: Uniquely identify an author based on the distribution of words in a document. Madison

6 About the Data Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.yale.edu/lawweb/avalon/federal/fed.htm http://www.constitution.org/fed/federa00.htm. http://www.constitution.org/fed/federa00.htm Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.) 6

7 7 Corpus The Federalist Papers corpus is a collection of 85 essays. Terms and Tokens The Federalist Papers taken as a whole contain over 190,000 tokens and approximately 8,800 unique tokens.

8 8 The Federalist Papers Diagram EM Clustering Logistic Regression TERGET: 1 – Madison; 0 – Hamilton; missing - unknown

9 9 Federalist Papers Clusters Cluster 1 Hamilton Madison Unknown 24 1 0 Cluster 2 Hamilton Madison Unknown 27 14 11 These clusters were obtained using numeric inputs derived from text mining. No author information was employed. Of interest is the fact that EM clustering placed all of the unknown essays into the same cluster that contains 14 of the 15 Madison essays.

10 10 Logistic Regression Classification of The Federalist Papers

11 Text Mining Results By Text Mining, the results of Mosteller and Wallace have been matched. The predictions in the second column from the right show the strength of the decision. The record with a predicted value of 0.709119 corresponds to essay 56, so the model thinks that this essay has the weakest association with Madison of all of the unknown essays. Essay 63, with a predicted value of 0.999691, has the strongest association with Madison. All of the essays in question have a stronger association with Madison than Hamilton, hence the classification into the Madison category. 11

12 12 Characteristics of a Document A document consists of letters words sentences paragraphs punctuation possible structural items: chapters, sections. The elements of a document can be counted (for example, the number of characters, words, or sentences) summarized (for example, mean, median, or kurtosis).

13 13 Comparing Two Documents Word Size Sentence Size Paragraph Size Word Freq Sentence Freq Paragraph Freq Doc 1 Doc 2

14 14

15 15 Contingency Table Comparing Essay 1 to Essay 37 continued...

16 16 Contingency Table Comparing Essay 1 to Essay 37

17 17 Text Miner Static Analysis

18 18 Text Miner Interactive Analysis

19 SVDPDF 19

20 20 SAS Education Course Descriptions The data represents a collection of 130 course summaries obtained from http://support.sas.com.http://support.sas.com The original 130 files were PDF files stored in one location on an HTTP server. A SAS DATA step was used to read the files from the server and write them to a local directory. The TMFILTER macro was used to process the PDF files and store the results as a text field in 130 document records in a SAS data set. The final SAS data set was modified to accommodate this demonstration and can be found in DMTM9.SASPDF.

21 21 Static Analysis with SAS Text Miner

22 22 Text Miner Settings

23 23 Interactive Results

24 24 Applications of Concept Lists A company can have specific conceptual goals. For example, are customers concerned about brand integrity quality price features, styles, and selection availability customer support?

25 25 Market Research for Quality What terms are most similar to the term “quality”? –Find Similar –Filter What documents address quality? –Filter on synonyms and similar terms –Find similar documents What secondary concepts reflect information on quality? –SVD coefficients –Concept links

26 VAERS 26

27 27 VAERS VAERS was created by the Food and Drug Administration (FDA) and Centers for Disease Control and Prevention (CDC) to receive reports about adverse events that might be associated with vaccines. No prescription drug or biological product, such as a vaccine, is completely free from side effects. Vaccines protect many people from dangerous illnesses, but vaccines, like drugs, can cause side effects, a small percentage of which may be serious. VAERS is used to continually monitor reports to determine whether any vaccine or vaccine lot has a higher than expected rate of events. Department of Health and Human Services, Public Health Service

28 28 VAERS Data was obtained from http://www.vaers.org/.http://www.vaers.org/ Data was downloaded in September 2002 as a series of CSV files. A SAS DATA step was used to read and process the data. The original data had 131,464 observations and 59 variables. Cleaning and screening reduced the data set to 48,523 observations and 44 variables. The data set has 6 text variables. The original data had 21, but 15 were sparsely populated.

29 29

30 30 VAERS Sample Entries 15 mon. male w/ hx of recurrent ear infections & measles in Feb. 89'. 5Apr89 was given MMR. Within 24 hrs /p vaccine, parents noted hearing deficit, confirmed by physician exam.  DEAF Urticaria, wheezy, & periorbital edema which abated /p administration of subcut. epinephrine, Bendryl IV, Solumendrol IV  ASTHMA Pt experienced chicken pox from head to toe subsequent to receiving one dose of varicella virus vaccine live.  INFECT

31 31 VAERS Text Fields SYMPTOM_TEXT : Full text description of the adverse reaction entered by a medical professional SYM01 : Brief description of primary symptom SYM02-SYM05 : Additional symptoms in decreasing importance

32 32 VAERS Initial Diagram

33 33 Equivalent Terms for Patient

34 34 Property Panel for VAERS Text Miner Analysis

35 35 Interactive Results

36 36 Clusters Window Why only one term when five were requested? …

37 37 Cases with Fever Last 16 entries out of 98

38 38 Headache Terms

39 39 Headache Documents

40 40 Terms Most Similar to Headache

41 41 Documents Most Similar to Headache

42 42 First 11 out of 65 Documents Filtered by Headache Terms

43 43 VAERS Predictive Modeling Diagram

44 44 Logistic Regression Model Effects Plot

45 45 Logistic Regression Lift Plot


Download ppt "Text Mining Three Cases. 2 Outline Federalist Papers SVDPDF VAERS"

Similar presentations


Ads by Google