Download presentation
1
LARGE DATA SETS FOR FREE
Sharanya Thandra
2
TABLE OF CONTENTS Datasets definition. Iris flower datasets.
Weka, Weka tool. Handling Different Datasets with Weka Techniques for managing large data sets. Compression Indexing Summerization Large data sets with Python Results and conclusions(Python) References.
3
What are datasets? A Dataset or Data set is a collection of data.
It corresponds to single database table, or a single statistical matrix. Column is a variable and row is a given number of the dataset in question. Singular form of datasets is datum.
4
Datasets properties? Characteristics that define data set’s structure and properties: Number and types of the attributes. Statistical measures such as standard deviation and Kurtosis. Values : real numbers or integers(nominal data) Datasets may further be generated by algorithms.
5
Classic Datasets? Iris flower data set Categorical Data Analysis
Robust Statistics Time series Extreme Values Bayesian Data Analysis The Bupa Liver data Anscombe’s Quaret
6
Iris Flower Dataset Multivariate data set , as an example of Discriminate analysis. A typical test case for classification techniques in Machine Learning. Good example: explains difference between supervised and unsupervised techniques in data mining.
7
Example Image An example of the so called “metro map” for the Iris data set.
8
Weka (Weh-kah) Weka is a Popular suite of Machine learning software written in Java. It contains collection of tools and algorithms for data analysis and predictive modeling. The original one was TCL/TK with non-java version, preprocessing utilities in C. More recent one is fully Java-based Version. All of Weka’s Techniques are predicated assuming that the data is available as a single Flat file or relation. Each Data point is described by a fixed number of attributes.
9
Uses of weka Original one was used mostly for Agricultural domains.
Latest one is being used in many different application areas – research and education. Data mining tasks such as Preprocessing , clustering, classification, regression, Visualization.
10
Panels of Weka Weka’s main Interface is the Explorer
The Explorer contains Interface Features With several Following Panels: Preprocess panel: Imports data from a database, as a CSV file. Preprocess the data using Filtering algorithm. Filters are used to transform data, helps in deleting instances and attribute according to specific data.
11
Panels of weka Classify Panel: makes user apply Classification and regression algorithms(classifiers) to resulting dataset. To estimate the accuracy of the predictive model, to visualize erroneous predictions, Roc curves. Associate panel: provides access to association rule learners that attempt to identify important interrelationships between attributed in the data.
12
Panels of Weka Cluster panel: Gives access to clustering techniques in Weka, for example: KMeans algorithm. It provides implementation of the expectation maximization algorithm for learning a mixture of normal distributions. Select attribute panel: Algorithms for Identifying most predictive attributes in a dataset. Visualize panel: provides scatter plot matrix, these plots can be further enlarged and analyzed using different selection operators.
13
Weka Tool Written in java and runs on any platform
Available to users under GNU general public license. Advantages of Using Weka: Freely Available for all the users. Portability is high as its fully implemented in Java Programming language as it runs on any platform. Provides comprehensive collection of data preprocessing and modeling techniques. Ease of usage with its graphical User Interface.
14
Some links for the information on weka
Following link shows link to download the tool.
15
Handling large datasets using weka
Data Characteristics: if data contains many zeros we make use of sparse data which saves a lot of memory. Every algorithm in Weka can take advantage of this memory savings for speeding up computation. An ARFF is a developer version , It’s a file which is in Ascii that describes a list of instances sharing a set of attributes. Has two different sections Header information followed by Data Information.
16
Datasets with Weka The ARFF header Section contains Relation declaration and attribute declarations. Declaration format <relation-name> Declarations <attribute-name> <datatype> Contains Numeric and Nominal attributes Nominal attributes: {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}
17
Handling large data sets with weka
String LCC string Date <name> date [<date-format>] Relational attributes: @attribute <name> relational <further attribute definitions> @end <name>
18
Techniques for managing large datasets
As We deal with huge amount of data, providing memory for every datum becomes a very important factor Two features such as compression and Indexing help to provide ways in decreasing the amount of room needed for datasets. Summarization is the third technique.
19
Techniques for handling large data sets
Its highly important to know about the data we use before using any of these techniques. Ask questions for yourself such as : How dense is the data? Upon which variables are users ,applications likely to process or clarify the data? How evenly distributed are values of variables? How is the data set being used? How often do we refresh it? Apart from knowing your data , before you perform any of the techniques always benchmark your results.
20
Techniques for handling large data sets
Data Set Compression : It is a technique for “squeezing out” excess blanks, then abbreviating repeated strings of values to decrease the data sets size and therefore to lower the storage space it requires. Uncompressed data will look like this: ADAMS MIKE BARNET MARYBETH, Whereas the same example after compression is
21
Compression Way to compress data is to use the COMPRESS=YES data set option ,whenever a new data set is created we use this : example: data sasuer.new(compress=yes); set sasuser.master; ….other statements…. run; The data sets descriptor portion records that the data set is compressed.
22
Compression Data sets descriptor portion records the compression of data set. Observations: Variables: Indexes: Observation Length: Deleted Observations: 0 Compressed: Yes Reuse space: No Sorted: No
23
Data set Indexing : A dataset Index is a data Structure that specifies a location of observations based on the values of one or more key variables. Lets consider the following example December 1(1,2,3,…) 2(…) February 1(22,23,…) 4(…) In the above example numbers 1,2 and 1,4 indicate the data set pages Numbers in the parenthesis are the relative observation numbers within that page matching each Distinct key value.
24
Data set Indexing The data sets Descriptor portion records that the data set indexes associated with it. Observations: Variables: Indexes: Observation length: Deleted Observations: 0 Compressed: NO Reuse space: NO Sorted: NO
25
Datasets Indexing Indexing Guidelines:
Only when there are fairly large number of distinct values for the variables to be indexed. The data is not randomly scattered throughout the dataset Data set is frequently queried and subset on the indexed values.
26
Summarization Questions before summarization:
To what level of detail must the data be accessible? How often is the data refreshed? Is it possible to anticipate how best to summarize date? The most important element to efficiency, is the thoughtful and careful consideration how the above techniques can best used Thorough testing of the techniques before launching them into production.
27
Example of large data sets for free
Million Song Dataset: The million Song Dataset is a freely available collection of audio features and metadata for a million contemporary music tracks Its purpose is to : Encourage research on algorithms that equals to commercial sizes. Reference dataset for evaluating research As a shortcut to creating a large datasets with APIs Help new researchers get started in the MIR field.
28
Large data sets using Python
29
With Python Considered an example : Task is to screen a huge set of large data files in text format with billions of entries each. To have a unified database structure available that combines all the coloumns, which represent different features, that are listed in the separate text files. Goal is to get a database which is extendable, and the workflow will require that one can pull entries with combining features for further computation efficiently. This can be achieved by sqlite3 Python module that work with SQLite database structures.
30
SQLite 3 Module SQLite is a C library which provides a lightweight disk-based database that don’t require a separate server process. It allows accessing the database using a nonstandard variant of the SQL. The SQlite3 module provides a SQL interface in compliance with DB-API 2.0 specification.
31
Getting started… To use the module – create a connection object that represents the database, data will be stored in the example.db file. >> Import sqlite3 conn=sqlite3.connect(‘example.db’) Once we have it connected , we can create a cursor object and call its execute() method to perform SQL commands.
32
With python.. c = conn.cursor() # Create table
c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''') # Insert a row of data c.execute("INSERT INTO stocks VALUES (' ','BUY','RHAT',100,35.14)") # Save (commit) the changes conn.commit() # We can also close the connection if we are done with it. Just be sure any changes have been committed or they will be lost. conn.close()
33
With python. The data we saved is Persistent and can use it further:
import sqlite3 conn = sqlite3.connect('example.db') c = conn.cursor() Note: Persistent. Also Our SQL operations will need values from Python variables. We shouldn’t assemble our query using Python’s String operations doing so causes insecurity as it makes the program easily attacked with SQL injection.
34
With Python # Never do this -- insecure! symbol = 'RHAT'
c.execute("SELECT * FROM stocks WHERE symbol = '%s'" % symbol) # Do this instead t = ('RHAT',) c.execute('SELECT * FROM stocks WHERE symbol=?', t) print c.fetchone() # Larger example that inserts many records at a time purchases = [(' ', 'BUY', 'IBM', 1000, 45.00), (' ', 'BUY', 'MSFT', 1000, 72.00), (' ', 'SELL', 'IBM', 500, 53.00), ] c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)', purchases)
35
With python import sqlite3 # create new db and make connection
conn = sqlite3.connect('my_db.db') c = conn.cursor() # create table c.execute('''CREATE TABLE my_db (id TEXT, my_var1 TEXT, my_var2 INT)''') # insert one row of data c.execute("INSERT INTO my_db VALUES ('ID_ ','YES', 4)") # insert multiple lines of data multi_lines =[ ('ID_ ','YES', 1), ('ID_ ','NO', 0), ('ID_ ','YES', 3), ('ID_ ','YES', 9), ('ID_ ','YES', 10) ] c.executemany('INSERT INTO my_db VALUES (?,?,?)', multi_lines) # save (commit) the changes conn.commit() # close connection conn.close()
36
With Python import sqlite3 # make connection to existing db
conn = sqlite3.connect('my_db.db') c = conn.cursor() # update field t = ('NO', 'ID_ ', ) c.execute("UPDATE my_db SET my_var1=? WHERE id=?", t) print "Total number of rows changed:", conn.total_changes # delete rows t = ('NO', ) c.execute("DELETE FROM my_db WHERE my_var1=?", t) print "Total number of rows deleted: ", conn.total_changes # add column c.execute("ALTER TABLE my_db ADD COLUMN 'my_var3' TEXT") # save changes conn.commit()
37
Benchmarking To test how fast is SQLite ?
To compare some of the simple speed comparisions, here is an example ,to measure CPU time Read in the text file line by line with simple Python. Read in the Text file to create an SQLite database. Query the whole database.
38
read_lines.py import time start_time = time.clock() lines = 0
with open("feature1.txt", "rb") as fileobj: for line in fileobj: lines += 1 elapsed_time = time.clock() - start_time print "Time elapsed: {} seconds".format(elapsed_time) print "Read {} lines".format(lines)
39
Results and conclusions:
40
Links for the references
41
Questions??
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.