DESIGN OF LARGE SCALE DATA ARCHIVAL AND RETRIEVAL SYSTEM FOR TRANSPORTATION SENSOR (WRITE-ONCE-READ-MANY TYPE) DATA. by Nirish Dhruv Department of Computer Science Advisor Dr. Taek Kwon Department of Electrical and Computer Engineering Graduate Comitee Dr. Donald Crouch Dr. Carolyn Crouch Dr. Taek Kwon Department of Computer Science Department of Computer Science Department of Electrical and Computer Engineering
Background ITS sensor networks produce huge amount of data Presently used for operational and monitoring uses due to huge size of data Examples: RWIS, WIM and traffic detector networks Efficient archival/retrieval need for planning and research
Problem Statement Present TMC Archive –Flat zip compressed format –Difficult to extract spatially correlated data –Need for efficient archival / retrieval for spatially and/or temporally correlated data
Existing File Format and Archive Unified Traffic Data Format (UTDF) ###.o30 file (5760 bytes) ###.v30 file (2880 bytes) 1-byte. 1-byte 2-byte. 2-byte 00:00:00 00:00:30 00:01:00 00:01:30 00:02:00. 23:59: Zip ###.v30 & ###.o30 files for 4000 Sensors yyyymmdd.traffic file Record Time Volume Occupancy
Review of Large Data Archive Data Warehouse –Inflow:To get data from various systems –Upflow: Put data to a more compact from –Downflow: Put compact data form to archival storage –Outflow: Output data to consumers as required –Metaflow: To manage warehouse itself
Why Data Warehouse? Simplicity Better Quality of Data Fast Access Platform Independent
Hierarchical Data Format (HDF) File format and library for storing scientific data Software includes I/O libraries and tools for analyzing, visualizing, and converting scientific data. Platform Independent
Common Data Format (CDF) Self-describing data abstraction for the storage and manipulation of multi-dimensional data in discipline-independent format File format and a library Transparent data compression Platform Independent API available in C, FORTRAN, Java, and Perl
Creating Traffic CDF Traffic Archive traffic.cdf C Program (.EXE) CDF 2.7 C API (DLL, Lib and cdf.h file)
Traffic Data Archive in CDF Designing Data Structure for traffic data Setting Dimensions Setting Variances Setting CDF variables, CDF data types, CDF attributes (meta-data), and compression algorithm
Data Organization Record Number Sensor IDTimeVolumeOccupancy 1100:00: :00: :01: :59: :00: :00: :01: :00: :00: :00: :59:
Variances Specification for traffic CDF rVariables Sensor IDTimeVolumeOccupancy Record Variance TRUEFALSETRUE First Dimension Variance FALSETRUE
CDF Compression Algorithms LevelUncompressed CDF RLEHuffmanAdaptive Huffman GZIP MB45.4 MB36.3 MB29.8 MB18.7 MB MB MB MB MB
Data Retrieval in CDF CDF Archive (.cdf) Station Definition ~~~~~~~~~ C Program using CDF API (.EXE) Volume Count (.txt) ~~~~~~~~~
Data Archive in SQL Server Traffic Data Archive (zipped Binary files) Traffic Data Archive (SQL Server 2000) Dynazip Active X control ADODB Connection 32-bit ODBC (DSN) Visual Basic Interface
Retrieval Task Station 1: 10069N Detectors: 3263,3264,3265,3266 Station 2: 10069S Station 492:17750W Station 1: Volume Computation 3263(Vol)+ 3264(Vol)+ 3265(Vol+ 3266(Vol) Station 2: Volume Computation Station 492: Volume Computation Text File : 10069N Total Vol 10069S Total Vol W Total Vol
Results on single day traffic data Binary Uncompressed CDFRDBMS Archival Time N/A5 minutes6 hours Size 40 MB16.6 MB 370 MB Retrieval Time N/A 2 minutes2 hours
Conclusions Transportation archive using CDF could be a better archive due to following reasons –More data storage with almost no additional storage requirements –Indexed data allowing random access –Open standard, portable and free –Can be used directly with many scientific visualization and analysis packages
Conclusions RDBMS is less suitable for large-scaled traffic data due to following reasons –Large storage requirements due to overheads –Retrieval is comparatively quite slow –Initial investment is expensive
Future Work Using XML with CDF for web Scaling CDF Adding more Features –Variables and attributes