Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tools for Unstructured Text

Similar presentations


Presentation on theme: "Tools for Unstructured Text"— Presentation transcript:

1 Tools for Unstructured Text
NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012 1 1

2 Hottest Analytics 2

3 Overview

4 Sites that List Tools DiRT, Digital Research Tools Kdnuggets
Kdnuggets text-processing.com Discussion Groups Text Analytics on linkedin Visual Analytics on linkedin 4

5 Bad News There is no open and freely available tool that is going to solve all your problems!!! 5

6 Good News There is a variety of tools that can be beneficial and must be used in combination to accomplish the goal!!! 6

7 Natural Language Processing (NLP)
Tokenization Part of speech tagging Stemming Stop word removal Other transformations 7

8 NLP Tools NLTK http://www.nltk.org/ http://text-processing.com OpenNLP
Stanford CoreNLP Mallet GATE LingPipe 8

9 Entity Extraction Finding entities, like People, Locations, Time, etc
Some have ability to add your own entities (with seed terms) Tools OpenNLP Stanford CoreNLP OpenCalais GATE 9

10 Journalism Application
Structuring unstructured data Social networks of entities Clustering Plotting data on time line Plotting locations on a map 10

11 Information Extraction
Automatically identifies and extracts binary relationships from English sentences TextRunner NACTEM, MEDIE ReVerb 11

12 Information Extraction: ReVerb
12

13 Question and Answer Parse the question to determine what type of information needs to be returned Leveraging approaches like the information extraction for retrieving the results 13

14 Journalism Application
Engagement! Help users find facts relevant to their situations. 14

15 Document Classification
Starts with a training set Predicts what class a document belongs Leveraging pure data mining approaches like Naïve Bayes, Decision Trees, Neural Networks Tools NLTK Mallet Weka Rapid Miner GATE Meandre 15

16 Journalism Application
IBM's ManyBills project Identifies the topic of each section in a Congressional bill for the purposes of identifying outliers. For example, if a Congressman proposes a bill about the environment, but it has a section deep down about banking regulation, ManyBills would identify that as an outlier and highlight it. 16

17 Document Similarity/Clustering
TF-IDF (Term frequency * inverse document frequency) Overview project (AP) Tools GATE Rapid Miner 17

18 Journalism Application
Identifying copycat legislation from year to year Clustering documents to find trends 18

19 Topic Modeling Exploratory approach to find patterns by finding words that frequently occur together Document can have multiple topics Words can exist in multiple topics Tools Mallet uses LDA (latent Dirichlet allocation) Other implementations as well… 19

20 Topic Exploration Topical Guide:
Tmve (Topic Model Visualization Engine) 20

21 Topical Guide 21

22 Tmve 22

23 Journalism Application
Reporting tool for making sense of corpus Isolating topics allows the user to focus only on the documents in a corpus that are relevant. There exists a clear potential for more data visualization. 23

24 Automatic Summarization
Identifies sentences from among the documents Identifies common information conveyed across all the documents and then reformulates new sentences expressing that information Aims to combine the main themes with completeness, readability, and conciseness Lots of algorithms, but not really software tools to download to run on your collection Meandre implements a HITS algorithm that identifies sentences but does not reformat them 24

25 Journalism Application
Newsblaster Summarizing all the news on the web Every night, the system crawls a series of Web sites, downloads articles, groups them together into "clusters" about the same topic, and summarizes each cluster. Ultimate Research Assistant 25

26 Sentiment Analysis NLTK APIs Meandre (concept tracking)
APIs AlchemyAPI Open Dover Lexalytics Saplo Meandre (concept tracking) Sentiment Analysis Symposium May 8, 2012 in New York 26

27 Journalism Application
Tracking Twitter sentiment about political candidates Comparing the tone of political statements over time or between candidates 27

28 Analysis Frameworks Meandre Rapid Miner http://rapid-i.com/ Weka
DocumentCloud Rapid Miner Weka 28

29 Meandre Workbench Web-based UI
Components and flows are retrieved from server Additional locations of components and flows can be added to server Create flow using a graphical drag and drop interface Change property values Execute the flow Components Flows Locations 29 29

30 Meandre Services from Firefox Plugin
Readability Analysis Date Entity to Simile Timeline Network Analysis Tag Cloud Analysis Location Entity to Google Map Automatic Summarization Example: Zotero, SEASR, Protovis, Google Maps, Simile 30

31 Topic Modeling Uses Mallet Topic Modeling to cluster nouns from almost 4000 documents from 19th century with 10 segments per document Example below is clustering the Bible and shows 8 topics with at most 200 keywords for that topic

32 Concept Mapping Sentiment Analysis
six core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)

33 Correlation Analysis Corrected OCR errs with spellchecking


Download ppt "Tools for Unstructured Text"

Similar presentations


Ads by Google