Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.

Making Sense of Large Volumes of Unstructured Email Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo

Outline  Introduction  Objectives  Implementation  Results  Conclusions

Introduction  Big Data Analytics  Topic Modeling  Sentiment Analysis

Big Data Analytics  In the modern world, big data analytics becomes a key component that is helpful for decision making and identifying hidden patterns of data  Big data analytics concept helps to get ideas for analyzing large data sets

Topic Modeling  Topic modeling is a statistical model used to discover "topics" in collection of unstructured textual documents  Topic is a key idea used to represent the content of documents  The Latent Dirichlet Allocation (LDA) topic model is frequently used for text analytics of set of documents

Sentiment Analysis  Sentiment analysis is the technique used to classify the polarity of text data  The polarity of text data can be considered as positive, negative and neutral

Sentiment Analysis  Lexicon-based approach is the unsupervised technique used for sentiment analysis  The lexicon-based technique takes less time for the classification and it does not require a training set Sentiment Analysis Machine Learning Approach Lexicon-Based Approach Supervised Learning Unsupervised Learning Dictionary-Based Approach Corpus-Based Approach

Objectives  Making sense of large collections of unstructured email responses o To identify the nature of the data to take ideas for analyzing data o To handle the large volume of dataset o To identify key ideas of Indian Internet users about Net Neutrality using topic model o To classify the polarity of responses using sentiment analysis

Implementation  Data  Data Preprocessing  Topic Modeling  Sentiment Analysis

Data  The data set included about 500,000 email submissions received in response to a debate on Net Neutrality in India  There were two types of responses o Separate answers to each twenty question asked from Telecom Regulatory Authority of India o General comments of internet users in India

Examples of Data Fragment of a Question Based Response

Examples of Data Example of a Comment

Data Preprocessing  Extracting plaintext from HTML files and cleaning data  Extracting answers for questions  Data formatting o Stop word removal o Lemmatizing

Technology Used for Data Preprocessing and Analyzing  Python  R  Natural Language Toolkit (NLTK)

Topic Modeling  Determining the number of topics o The number of topics was determined by examining the topic models which were fitted in each question  Evaluating the generated topics  Visualizing the topics

Sentiment Analysis  Sentiment analysis using Multi-perspective Question Answering lexicon resource  Sentiment analysis using SentiWordNet lexicon resource

Results Topic Modeling Topic 1Topic 2Topic 3 WeightWordWeightWordWeightWord 0.024law0.036increase0.009speed 0.019time0.024account0.008neutrality 0.016act0.024lead0.008net 0.015telecom0.024cost0.007ott 0.012digital0.013complexity0.006penetration 0.012consultation0.013financial0.006establish 0.012application0.013accessible0.006evolving 0.011information0.012degradation0.005high 0.010subject0.011involve0.004made 0.010indian0.010essential0.004early LDA Topic Model

Results Topic Modeling Word Cloud

Results Sentiment Analysis  The lexicon-based approach to sentiment detection reveals that most of the responses are positive

Conclusions  Identifying general ideas of Indian Internet users about Net Neutrality by carrying out analytics on this large data set  The topics generated from the LDA model are mainly focused on Internet problems and Net Neutrality  There were key issues such as regulations of OTT players, Internet security and privacy, speed of Internet services  The results of the lexicon-based approach to sentiment detection reveals that most of the responses are positive

References  S. Sagiroglu and D. Sinanc. “Big data: A review,” in Collaboration Technologies and Systems (CTS), 2013 International Conference on. IEEE, 2013, pp. 42-47.  W. Peng and D. H. Park. “Generate adjective sentiment dictionary for social media sentiment analysis using constrained nonnegative matrix factorization,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, Urbana, 2004.  M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based methods for sentiment analysis,” Computational linguistics, vol. 37, no. 2, pp. 267-307, 2011.  A. Hamouda, and M. Rohaim, “Reviews classification using SentiWordNet lexicon” vol. 2, no. 1, January 2011.  D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation,” in Journal of Machine Learning Research, 2003, pp. 993_1022.

Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.

Similar presentations

Presentation on theme: "Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.

Similar presentations

Presentation on theme: "Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo."— Presentation transcript:

Similar presentations

About project

Feedback