Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.

Similar presentations


Presentation on theme: "Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo."— Presentation transcript:

1 Making Sense of Large Volumes of Unstructured Email Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo

2 Outline  Introduction  Objectives  Implementation  Results  Conclusions

3 Introduction  Big Data Analytics  Topic Modeling  Sentiment Analysis

4 Big Data Analytics  In the modern world, big data analytics becomes a key component that is helpful for decision making and identifying hidden patterns of data  Big data analytics concept helps to get ideas for analyzing large data sets

5 Topic Modeling  Topic modeling is a statistical model used to discover "topics" in collection of unstructured textual documents  Topic is a key idea used to represent the content of documents  The Latent Dirichlet Allocation (LDA) topic model is frequently used for text analytics of set of documents

6 Sentiment Analysis  Sentiment analysis is the technique used to classify the polarity of text data  The polarity of text data can be considered as positive, negative and neutral

7 Sentiment Analysis  Lexicon-based approach is the unsupervised technique used for sentiment analysis  The lexicon-based technique takes less time for the classification and it does not require a training set Sentiment Analysis Machine Learning Approach Lexicon-Based Approach Supervised Learning Unsupervised Learning Dictionary-Based Approach Corpus-Based Approach

8 Objectives  Making sense of large collections of unstructured email responses o To identify the nature of the data to take ideas for analyzing data o To handle the large volume of dataset o To identify key ideas of Indian Internet users about Net Neutrality using topic model o To classify the polarity of responses using sentiment analysis

9 Implementation  Data  Data Preprocessing  Topic Modeling  Sentiment Analysis

10 Data  The data set included about 500,000 email submissions received in response to a debate on Net Neutrality in India  There were two types of responses o Separate answers to each twenty question asked from Telecom Regulatory Authority of India o General comments of internet users in India

11 Examples of Data Fragment of a Question Based Response

12 Examples of Data Example of a Comment

13 Data Preprocessing  Extracting plaintext from HTML files and cleaning data  Extracting answers for questions  Data formatting o Stop word removal o Lemmatizing

14 Technology Used for Data Preprocessing and Analyzing  Python  R  Natural Language Toolkit (NLTK)

15 Topic Modeling  Determining the number of topics o The number of topics was determined by examining the topic models which were fitted in each question  Evaluating the generated topics  Visualizing the topics

16 Sentiment Analysis  Sentiment analysis using Multi-perspective Question Answering lexicon resource  Sentiment analysis using SentiWordNet lexicon resource

17 Results Topic Modeling Topic 1Topic 2Topic 3 WeightWordWeightWordWeightWord 0.024law0.036increase0.009speed 0.019time0.024account0.008neutrality 0.016act0.024lead0.008net 0.015telecom0.024cost0.007ott 0.012digital0.013complexity0.006penetration 0.012consultation0.013financial0.006establish 0.012application0.013accessible0.006evolving 0.011information0.012degradation0.005high 0.010subject0.011involve0.004made 0.010indian0.010essential0.004early LDA Topic Model

18 Results Topic Modeling Word Cloud

19 Results Sentiment Analysis  The lexicon-based approach to sentiment detection reveals that most of the responses are positive

20 Conclusions  Identifying general ideas of Indian Internet users about Net Neutrality by carrying out analytics on this large data set  The topics generated from the LDA model are mainly focused on Internet problems and Net Neutrality  There were key issues such as regulations of OTT players, Internet security and privacy, speed of Internet services  The results of the lexicon-based approach to sentiment detection reveals that most of the responses are positive

21 References  S. Sagiroglu and D. Sinanc. “Big data: A review,” in Collaboration Technologies and Systems (CTS), 2013 International Conference on. IEEE, 2013, pp. 42-47.  W. Peng and D. H. Park. “Generate adjective sentiment dictionary for social media sentiment analysis using constrained nonnegative matrix factorization,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, Urbana, 2004.  M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based methods for sentiment analysis,” Computational linguistics, vol. 37, no. 2, pp. 267-307, 2011.  A. Hamouda, and M. Rohaim, “Reviews classification using SentiWordNet lexicon” vol. 2, no. 1, January 2011.  D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation,” in Journal of Machine Learning Research, 2003, pp. 993_1022.

22

23


Download ppt "Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo."

Similar presentations


Ads by Google