Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Similar presentations


Presentation on theme: "Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas."— Presentation transcript:

1 Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas

2 Hoyle paper 019-31 SUGI 31 SAS-L topics Read each weekly topic list from http://www.listserv.uga.edu/archives/sas-l.html http://www.listserv.uga.edu/archives/sas-l.html Parse topic, HTMLdecode Strip “Re: “ /* strip variations of re: */ topicRE = prxparse('/^ *[R|r][E|e] *: *(.*)/'); if prxmatch(topicRE, topic) then do; topic = prxposn(topicRE, 1,topic); end; Proc SQL to aggregate topic counts across weeks

3 Hoyle paper 019-31 SUGI 31 SAS-L 2005 35324 thread/topic lines in the html files 7081 threads after merging across weeks and a little cleaning

4 Hoyle paper 019-31 SUGI 31 SAS-L Top Threads in Number of Messages

5 Hoyle paper 019-31 SUGI 31 Text Miner on the SAS-L topics

6 Hoyle paper 019-31 SUGI 31

7 Hoyle paper 019-31 SUGI 31

8 Hoyle paper 019-31 SUGI 31

9 Hoyle paper 019-31 SUGI 31

10 Hoyle paper 019-31 SUGI 31 Largest clusters

11 Hoyle paper 019-31 SUGI 31 Smaller Clusters

12 Hoyle paper 019-31 SUGI 31 Message Content

13 Hoyle paper 019-31 SUGI 31 Web scraping with tmfilter options noxwait; %macro aweek(week=0501a); x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week"; x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredposts\&week"; libname sugi31 'C:\ddrive\projects\sugs\sugi31\SASLBOF\datasets'; %tmfilter( dataset=sugi31.SL&week., dir=C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week, destdir=C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredPosts\&week, URL=http://listserv.uga.edu/cgi-bin/wa?A1=ind&week.%NRSTR(&L=sas-l), depth=1, links=sugi31.SL&week.L, norestrict=' ', numchars=2000) %mend aweek; %aweek(week=0501a); %aweek(week=0501b);

14 Hoyle paper 019-31 SUGI 31 Parse date and sender

15 Hoyle paper 019-31 SUGI 31 Using a 10% sample of message text

16 Hoyle paper 019-31 SUGI 31 Using a 10% sample of message text

17 Hoyle paper 019-31 SUGI 31 Filter out too common terms, listserv

18 Hoyle paper 019-31 SUGI 31 Filter out too common terms, listserv


Download ppt "Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas."

Similar presentations


Ads by Google