Download presentation
Presentation is loading. Please wait.
Published byElvin Price Modified over 8 years ago
1
Making Sense of Large Volumes of Unstructured Email Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo
2
Outline Introduction Objectives Implementation Results Conclusions
3
Introduction Big Data Analytics Topic Modeling Sentiment Analysis
4
Big Data Analytics In the modern world, big data analytics becomes a key component that is helpful for decision making and identifying hidden patterns of data Big data analytics concept helps to get ideas for analyzing large data sets
5
Topic Modeling Topic modeling is a statistical model used to discover "topics" in collection of unstructured textual documents Topic is a key idea used to represent the content of documents The Latent Dirichlet Allocation (LDA) topic model is frequently used for text analytics of set of documents
6
Sentiment Analysis Sentiment analysis is the technique used to classify the polarity of text data The polarity of text data can be considered as positive, negative and neutral
7
Sentiment Analysis Lexicon-based approach is the unsupervised technique used for sentiment analysis The lexicon-based technique takes less time for the classification and it does not require a training set Sentiment Analysis Machine Learning Approach Lexicon-Based Approach Supervised Learning Unsupervised Learning Dictionary-Based Approach Corpus-Based Approach
8
Objectives Making sense of large collections of unstructured email responses o To identify the nature of the data to take ideas for analyzing data o To handle the large volume of dataset o To identify key ideas of Indian Internet users about Net Neutrality using topic model o To classify the polarity of responses using sentiment analysis
9
Implementation Data Data Preprocessing Topic Modeling Sentiment Analysis
10
Data The data set included about 500,000 email submissions received in response to a debate on Net Neutrality in India There were two types of responses o Separate answers to each twenty question asked from Telecom Regulatory Authority of India o General comments of internet users in India
11
Examples of Data Fragment of a Question Based Response
12
Examples of Data Example of a Comment
13
Data Preprocessing Extracting plaintext from HTML files and cleaning data Extracting answers for questions Data formatting o Stop word removal o Lemmatizing
14
Technology Used for Data Preprocessing and Analyzing Python R Natural Language Toolkit (NLTK)
15
Topic Modeling Determining the number of topics o The number of topics was determined by examining the topic models which were fitted in each question Evaluating the generated topics Visualizing the topics
16
Sentiment Analysis Sentiment analysis using Multi-perspective Question Answering lexicon resource Sentiment analysis using SentiWordNet lexicon resource
17
Results Topic Modeling Topic 1Topic 2Topic 3 WeightWordWeightWordWeightWord 0.024law0.036increase0.009speed 0.019time0.024account0.008neutrality 0.016act0.024lead0.008net 0.015telecom0.024cost0.007ott 0.012digital0.013complexity0.006penetration 0.012consultation0.013financial0.006establish 0.012application0.013accessible0.006evolving 0.011information0.012degradation0.005high 0.010subject0.011involve0.004made 0.010indian0.010essential0.004early LDA Topic Model
18
Results Topic Modeling Word Cloud
19
Results Sentiment Analysis The lexicon-based approach to sentiment detection reveals that most of the responses are positive
20
Conclusions Identifying general ideas of Indian Internet users about Net Neutrality by carrying out analytics on this large data set The topics generated from the LDA model are mainly focused on Internet problems and Net Neutrality There were key issues such as regulations of OTT players, Internet security and privacy, speed of Internet services The results of the lexicon-based approach to sentiment detection reveals that most of the responses are positive
21
References S. Sagiroglu and D. Sinanc. “Big data: A review,” in Collaboration Technologies and Systems (CTS), 2013 International Conference on. IEEE, 2013, pp. 42-47. W. Peng and D. H. Park. “Generate adjective sentiment dictionary for social media sentiment analysis using constrained nonnegative matrix factorization,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, Urbana, 2004. M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, “Lexicon-based methods for sentiment analysis,” Computational linguistics, vol. 37, no. 2, pp. 267-307, 2011. A. Hamouda, and M. Rohaim, “Reviews classification using SentiWordNet lexicon” vol. 2, no. 1, January 2011. D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation,” in Journal of Machine Learning Research, 2003, pp. 993_1022.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.