LARGE DATA SETS FOR FREE

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
WEKA (sumber: Machine Learning with WEKA). What is WEKA? Weka is a collection of machine learning algorithms for data mining tasks. Weka contains.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
An Extended Introduction to WEKA. Data Mining Process.
Modules, Hierarchy Charts, and Documentation
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Page 1 ISMT E-120 Introduction to Microsoft Access & Relational Databases The Influence of Software and Hardware Technologies on Business Productivity.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
An Exercise in Machine Learning
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
CHAPTER 9 DATABASE MANAGEMENT © Prepared By: Razif Razali.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
ASP.NET Programming with C# and SQL Server First Edition
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
1 Overview of Databases. 2 Content Databases Example: Access Structure Query language (SQL)
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Appendix: The WEKA Data Mining Software
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Working with MS SQL Server Textbook Chapter 14.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
Chapter 17 Creating a Database.
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin
For ITCS 6265/8265 Fall 2009 TA: Fei Xu UNC Charlotte.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
ITGS Databases.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Relational Databases: Basic Concepts BCHB Lecture 21 By Edwards & Li Slides:
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
CS4432: Database Systems II
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
CHAPTER 9 File Storage Shared Preferences SQLite.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Introduction to Database Programming with Python Gary Stewart
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Understanding Core Database Concepts Lesson 1. Objectives.
Module 11: File Structure
GO! with Microsoft Office 2016
CHP - 9 File Structures.
Introduction to Web programming
GO! with Microsoft Access 2016
sqlite3 — DB-API 2.0 interface for SQLite databases
Waikato Environment for Knowledge Analysis
Prepared by Kimberly Sayre and Jinbo Bi
Exploring Microsoft Office Access 2007
Chapter 4 Application Software
Computer Programming.
GCSE Computing Databases.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Machine Learning with Weka
Tutorial for WEKA Heejun Kim June 19, 2018.
Spreadsheets, Modelling & Databases
Lecture 10 – Introduction to Weka
Programming with Data Lab 7
Understanding Core Database Concepts
Presentation transcript:

LARGE DATA SETS FOR FREE Sharanya Thandra

TABLE OF CONTENTS Datasets definition. Iris flower datasets. Weka, Weka tool. Handling Different Datasets with Weka Techniques for managing large data sets. Compression Indexing Summerization Large data sets with Python Results and conclusions(Python) References.

What are datasets? A Dataset or Data set is a collection of data. It corresponds to single database table, or a single statistical matrix. Column is a variable and row is a given number of the dataset in question. Singular form of datasets is datum.

Datasets properties? Characteristics that define data set’s structure and properties: Number and types of the attributes. Statistical measures such as standard deviation and Kurtosis. Values : real numbers or integers(nominal data) Datasets may further be generated by algorithms.

Classic Datasets? Iris flower data set Categorical Data Analysis Robust Statistics Time series Extreme Values Bayesian Data Analysis The Bupa Liver data Anscombe’s Quaret

Iris Flower Dataset Multivariate data set , as an example of Discriminate analysis. A typical test case for classification techniques in Machine Learning. Good example: explains difference between supervised and unsupervised techniques in data mining.

Example Image An example of the so called “metro map” for the Iris data set.

Weka (Weh-kah) Weka is a Popular suite of Machine learning software written in Java. It contains collection of tools and algorithms for data analysis and predictive modeling. The original one was TCL/TK with non-java version, preprocessing utilities in C. More recent one is fully Java-based Version. All of Weka’s Techniques are predicated assuming that the data is available as a single Flat file or relation. Each Data point is described by a fixed number of attributes.

Uses of weka Original one was used mostly for Agricultural domains. Latest one is being used in many different application areas – research and education. Data mining tasks such as Preprocessing , clustering, classification, regression, Visualization.

Panels of Weka Weka’s main Interface is the Explorer The Explorer contains Interface Features With several Following Panels: Preprocess panel: Imports data from a database, as a CSV file. Preprocess the data using Filtering algorithm. Filters are used to transform data, helps in deleting instances and attribute according to specific data.

Panels of weka Classify Panel: makes user apply Classification and regression algorithms(classifiers) to resulting dataset. To estimate the accuracy of the predictive model, to visualize erroneous predictions, Roc curves. Associate panel: provides access to association rule learners that attempt to identify important interrelationships between attributed in the data.

Panels of Weka Cluster panel: Gives access to clustering techniques in Weka, for example: KMeans algorithm. It provides implementation of the expectation maximization algorithm for learning a mixture of normal distributions. Select attribute panel: Algorithms for Identifying most predictive attributes in a dataset. Visualize panel: provides scatter plot matrix, these plots can be further enlarged and analyzed using different selection operators.

Weka Tool Written in java and runs on any platform Available to users under GNU general public license. Advantages of Using Weka: Freely Available for all the users. Portability is high as its fully implemented in Java Programming language as it runs on any platform. Provides comprehensive collection of data preprocessing and modeling techniques. Ease of usage with its graphical User Interface.

Some links for the information on weka http://arxiv.org/ftp/arxiv/papers/1310/1310.4647.pdf Following link shows link to download the tool. http://www.cs.waikato.ac.nz/~ml/weka/

Handling large datasets using weka Data Characteristics: if data contains many zeros we make use of sparse data which saves a lot of memory. Every algorithm in Weka can take advantage of this memory savings for speeding up computation. An ARFF is a developer version , It’s a file which is in Ascii that describes a list of instances sharing a set of attributes. Has two different sections Header information followed by Data Information.

Datasets with Weka The ARFF header Section contains Relation declaration and attribute declarations. The @Relation Declaration format : @relation <relation-name> The @attribute Declarations format: @attribute <attribute-name> <datatype> Contains Numeric and Nominal attributes Nominal attributes:  {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}

Handling large data sets with weka String attributes: @ATTRIBUTE LCC string Date attributes::@attribute <name> date [<date-format>] Relational attributes: @attribute <name> relational <further attribute definitions> @end <name>

Techniques for managing large datasets As We deal with huge amount of data, providing memory for every datum becomes a very important factor Two features such as compression and Indexing help to provide ways in decreasing the amount of room needed for datasets. Summarization is the third technique.

Techniques for handling large data sets Its highly important to know about the data we use before using any of these techniques. Ask questions for yourself such as : How dense is the data? Upon which variables are users ,applications likely to process or clarify the data? How evenly distributed are values of variables? How is the data set being used? How often do we refresh it? Apart from knowing your data , before you perform any of the techniques always benchmark your results.

Techniques for handling large data sets Data Set Compression : It is a technique for “squeezing out” excess blanks, then abbreviating repeated strings of values to decrease the data sets size and therefore to lower the storage space it requires. Uncompressed data will look like this: ADAMS MIKE BARNET MARYBETH, Whereas the same example after compression is @ADAMS#@MIKE# @BARNET#@MARYBETH#

Compression Way to compress data is to use the COMPRESS=YES data set option ,whenever a new data set is created we use this : example: data sasuer.new(compress=yes); set sasuser.master; ….other statements…. run; The data sets descriptor portion records that the data set is compressed.

Compression Data sets descriptor portion records the compression of data set. Observations: 466560 Variables: 12 Indexes: 0 Observation Length: 149 Deleted Observations: 0 Compressed: Yes Reuse space: No Sorted: No

Data set Indexing : A dataset Index is a data Structure that specifies a location of observations based on the values of one or more key variables. Lets consider the following example December 1(1,2,3,…) 2(…) February 1(22,23,…) 4(…) In the above example numbers 1,2 and 1,4 indicate the data set pages Numbers in the parenthesis are the relative observation numbers within that page matching each Distinct key value.

Data set Indexing The data sets Descriptor portion records that the data set indexes associated with it. Observations: 466560 Variables: 12 Indexes: 1 Observation length: 149 Deleted Observations: 0 Compressed: NO Reuse space: NO Sorted: NO

Datasets Indexing Indexing Guidelines: Only when there are fairly large number of distinct values for the variables to be indexed. The data is not randomly scattered throughout the dataset Data set is frequently queried and subset on the indexed values.

Summarization Questions before summarization: To what level of detail must the data be accessible? How often is the data refreshed? Is it possible to anticipate how best to summarize date? The most important element to efficiency, is the thoughtful and careful consideration how the above techniques can best used Thorough testing of the techniques before launching them into production.

Example of large data sets for free Million Song Dataset: The million Song Dataset is a freely available collection of audio features and metadata for a million contemporary music tracks Its purpose is to : Encourage research on algorithms that equals to commercial sizes. Reference dataset for evaluating research As a shortcut to creating a large datasets with APIs Help new researchers get started in the MIR field.

Large data sets using Python

With Python Considered an example : Task is to screen a huge set of large data files in text format with billions of entries each. To have a unified database structure available that combines all the coloumns, which represent different features, that are listed in the separate text files. Goal is to get a database which is extendable, and the workflow will require that one can pull entries with combining features for further computation efficiently. This can be achieved by sqlite3 Python module that work with SQLite database structures.

SQLite 3 Module SQLite is a C library which provides a lightweight disk-based database that don’t require a separate server process. It allows accessing the database using a nonstandard variant of the SQL. The SQlite3 module provides a SQL interface in compliance with DB-API 2.0 specification.

Getting started… To use the module – create a connection object that represents the database, data will be stored in the example.db file. >> Import sqlite3 conn=sqlite3.connect(‘example.db’) Once we have it connected , we can create a cursor object and call its execute() method to perform SQL commands.

With python.. c = conn.cursor() # Create table c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''') # Insert a row of data c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)") # Save (commit) the changes conn.commit() # We can also close the connection if we are done with it. Just be sure any changes have been committed or they will be lost. conn.close()

With python. The data we saved is Persistent and can use it further: import sqlite3 conn = sqlite3.connect('example.db') c = conn.cursor() Note: Persistent. Also Our SQL operations will need values from Python variables. We shouldn’t assemble our query using Python’s String operations doing so causes insecurity as it makes the program easily attacked with SQL injection. http://xkcd.com/327/

With Python # Never do this -- insecure! symbol = 'RHAT' c.execute("SELECT * FROM stocks WHERE symbol = '%s'" % symbol) # Do this instead t = ('RHAT',) c.execute('SELECT * FROM stocks WHERE symbol=?', t) print c.fetchone() # Larger example that inserts many records at a time purchases = [('2006-03-28', 'BUY', 'IBM', 1000, 45.00), ('2006-04-05', 'BUY', 'MSFT', 1000, 72.00), ('2006-04-06', 'SELL', 'IBM', 500, 53.00), ] c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)', purchases)

With python import sqlite3 # create new db and make connection conn = sqlite3.connect('my_db.db') c = conn.cursor() # create table c.execute('''CREATE TABLE my_db (id TEXT, my_var1 TEXT, my_var2 INT)''') # insert one row of data c.execute("INSERT INTO my_db VALUES ('ID_2352532','YES', 4)") # insert multiple lines of data multi_lines =[ ('ID_2352533','YES', 1), ('ID_2352534','NO', 0), ('ID_2352535','YES', 3), ('ID_2352536','YES', 9), ('ID_2352537','YES', 10) ] c.executemany('INSERT INTO my_db VALUES (?,?,?)', multi_lines) # save (commit) the changes conn.commit() # close connection conn.close()

With Python import sqlite3 # make connection to existing db conn = sqlite3.connect('my_db.db') c = conn.cursor() # update field t = ('NO', 'ID_2352533', ) c.execute("UPDATE my_db SET my_var1=? WHERE id=?", t) print "Total number of rows changed:", conn.total_changes # delete rows t = ('NO', ) c.execute("DELETE FROM my_db WHERE my_var1=?", t) print "Total number of rows deleted: ", conn.total_changes # add column c.execute("ALTER TABLE my_db ADD COLUMN 'my_var3' TEXT") # save changes conn.commit()

Benchmarking To test how fast is SQLite ? To compare some of the simple speed comparisions, here is an example ,to measure CPU time Read in the text file line by line with simple Python. Read in the Text file to create an SQLite database. Query the whole database.

read_lines.py import time start_time = time.clock() lines = 0 with open("feature1.txt", "rb") as fileobj: for line in fileobj: lines += 1 elapsed_time = time.clock() - start_time print "Time elapsed: {} seconds".format(elapsed_time) print "Read {} lines".format(lines)

Results and conclusions:

Links for the references http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka http://en.wikipedia.org/wiki/Comma-separated_values http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf http://labrosa.ee.columbia.edu/millionsong/ http://faculty.washington.edu/kenrice/sisg-adv/sisg-09.pdf http://sebastianraschka.com/Articles/sqlite3_database.html http://www.mssqltips.com/sqlservertip/1200/handling-large-sql-server-tables-with-data-partitioning/

Questions??