Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI Using text mining on State of the union addresses to gain political insights Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Introduction Mining the data to find interesting patterns, useful insights, customer data and their relationship - data mining Text mining - aims at finding useful insights from the dataset comprised of text data. Examples: Sentiment analysis. Google : Search engine Facebook, Instagram : Hashtags Text mining on State of the union addresses to gain political insights Project findings (trends and issues) on interactive dashboards.
Big Problem Text mining involves writing programs to analyze the text data to retrieve something useful from the data Approaches: Bag of Words: Uses the entire collection of words that constitute the text to determine the sentiment TFIDIF: Calculates the word frequency that is relative to the total word count of the document. (Except stop words)
Small Problem Using text mining algorithms to extract the political insights from the presidential addresses of “State of the Union” speeches of every president since 1790. Project these insights and trends in interactive dashboards. Find a correlation between the most frequent words that appear in the presidential state of the union addresses and the trends in the issues facing our country. Emphasis on a particular word in a speech, implies some important trend or issue in that year.
Proposed Solution No ready dataset that we can use. Python scraper using library “Beautiful Soup”(scraping the state of the union website) Clean the data Hadoop’s map reduce platform Determines the word frequency of each word per year. Divides the entire data into key value pairs We use this information to deduce the trend of topics in that year’s presidential state of the union address.
Proposed Solution - Dashboards Store the mined data in a database and then project it on to various dashboards. Planning on D3.js or Chart JS Few plans to implement dashboards: Changes in trends between two presidents who served consecutively. Change of trends in a single president’s entire term. Determine major trends over a period of time
Data analysis and experimental work plan to evaluate the proposed solution No dedicated training and test set. Measuring effectiveness - comparing the results of our model with the major events in the history Ex: 9/11 attack of 2001 Mr. Donald J Trump. Speeches - See the trends related to borders, security, wall, Mexicans, Muslims etc These aspects show how well the dashboards reflect these results.
Related Work Dimensions and features. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Dimensions and features. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. How to weigh the word ? An Improved Feature Space for Sentiment Analysis. congressional bill - approvals.
Related Work (Contd…) Stemming and its effects on TFIDF Ranking. Why stemming ? Word Isolation. Refinement of TF-IDF Schemes for Web Pages using their Hyperlinked Neighboring Pages. Better classification. An improved TF-IDF approach for text classification Confidence, language independent.
Conclusion How would text mining algorithms in extracting the political insights help ? Who would use them Journalists, politicians.