Download presentation
Presentation is loading. Please wait.
1
Enhancement of IITBombayX-Open edX
Mentor: Sukla Nag Principle Investigator: Dr D.B.Phatak
2
Big Data! The term is often used while speaking about petabytes and exabytes of data, much of which cannot be integrated easily! The Three V’s: Extreme Volume Wide Variety Velocity at which it must be processed!
3
Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time requires a platform like Hadoop to store large data sets across a distributed cluster and MapReduce to coordinate, combine and process data from multiple sources.
4
WHAT FALLS UNDER BIG DATA?
Big Data is broadly classified into three types: Structured data : Relational data. Semi Structured data : XML data Unstructured data : Word, PDF, Text, Media Logs Fields that come under the umbrella of big data are: Black Box Data,Social media data,Power Grid Data,Stock Exchange Data Transport Data, Search Engine Data.
5
HOW IS BIG DATA ANALYSED?
Big data can be analyzed with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics, data mining, text analytics and statistical analysis. Big Data is analyzed using tools such as Hadoop, YARN, MapReduce, Spark, Hive and Pig as well as NoSQL databases.
6
Technologies Used Apache Hadoop: Hadoop is an open-source framework written in Java that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. Apache Hive:The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Sqoop:Sqoop is a tool designed to transfer data between Hadoop and relational database servers. Apache Spark:Spark is a fast and general cluster computing system for Big Data.It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis.
7
Technologies Used Luigi: Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration and much more. Django: Django is an advanced Web framework written in Python. Python: Python is a multiparadigm, general-purpose, interpreted, high-level programming language. MySQL: MySQL is the most popular Open Source Relational SQL database management system.
8
Technologies Used JavaScript: JavaScript is a powerful and popular language for programming on the web. Apache Tomcat Server: Apache Tomcat, often referred to as Tomcat, is an open-source web server and servlet container developed by the Apache Software Foundation (ASF). MongoDB: MongoDB is an open source, document-oriented database designed with both scalability and developer agility in mind. HTML:HyperText Markup Language, commonly referred to as HTML,is the language that describes the structure and the semantic content of a web document.
9
Event Log Parsing
10
What Open edX does ? Parse - Process - Repeat
It parses the same log files again and again for each task. It has different sub - parsers for different analysis modules.
11
What we already had? Parse once use many times.
It is completely hard - coded for each Event. It filters out the event logs with undefined format.
12
Limitations List of all Event types are required.
It is less flexible for new event types. It doesn’t extract all the available information from the logs.
13
What we have done? We generalised the parser
All the identified events, event types and their attributes are stored in database. The event type of a log entry is searched for a match.
14
Format of EventName Table
rt
15
Format of Event Type and Event Attributes Tables
Event Type Table Event Attributes Table
16
If there is a match then it is processed further.
Else the new event type is detected and stored in a file. All the failed logs are stored in another file. New events are updated into the database. All the failed logs are parsed again. Extracted Information is used for analysis in open edX insights.
17
New Events detected
18
Discussion Forum related Events
The open edX Parser does not have any parsing mechanism for Discussion Forum related events. The modified parser is able to catch all the essential discussion forum related events.
19
Some Discussion Forum events
forum_open view_thread upvote, unvote forum_searched reply delete update user_follow, user_visited
20
Problems Existing Some log entries have unexpected formats.
For example some event names are as follows: /apple-touch-icon-precomposed.png /robots.txt Some of the event logs does not contain all necessary information.
21
How Analytics Pipeline Work
22
Edx Insights EdX Insights makes information about courses available to course team members. EdX Insights provides the data about - Enrollment Engagement Performance Help you monitor how students are doing Edx-analytics-pipeline has to be run periodically to update the data available to Insights.
23
Tasks to update Insights
Few Tasks are defined in the edx-analytics-pipeline to update the insights Various Tasks – AnswerDistributionWorkflow ImportEnrollmentsIntoMysql InsertToMysqlCourseEnrollByCountryWorkflow CourseActivityWeeklyTask
24
Running the pipeline remote-task --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait ImportEnrollmentsIntoMysql --interaval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") --local-scheduler sudo mysql SELECT * FROM reports.course_enrollment_daily; This gives the course enrollments over time by counting them in the event logs.
27
Current edx-analysis-pipeline uses Hadoop MapReduce for all it’s analytics.
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. We implemented some of the tasks in luigi-Spark. It uses the concept of an Resilient Distributed Dataset (RDD),which allows it to transparently store data on memory and persist it to disc only it’s needed.
28
Spark over Hadoop MapReduce:
Advantages : Speed : Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O. From the Spark academic paper: "RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition." This removes the need for replication to achieve fault tolerance. Ease of Use: Spark lets you quickly write applications in Java, Scala, or Python.
29
Disadvantages: Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results. Being in its early stages of development it is still recovering from bugs.
30
Course Enrollment This task calculates change in number of users who have enrolled in a particular course by day. Brief Outline : The mapper first loads all the log files in an RDD, then extracts all the relevant information regarding course enrollment by parsing the logs. It returns the key: (course_id,user_id) and value:(timestamp,action_value) where timestamp is in ISO format and action value is 1 or -1 depending on activation or deactivation.
31
Then the we group by key these pairs and extract all those dates where at the end there was a change for the user. Thus key:(course_id,datestmap) and value:(action_value) is returned. Now if we sum these values by key we get the total number of students enrolled in a particular course on a given date. Thus, we can analyse change in enrollment activity of a course over period of time.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.