Large Object Datatypes
Agenda In this session, we will be discussing: Summary What Is Unstructured Data? Why Should I Place Unstructured Data In Sybase IQ? Understanding How Sybase IQ Stores The Data Sybase IQ Limitations for Unstructured Data Creating, Loading, and Extracting Unstructured Data New Functions Sybase IQ 12.6 Functionality Tuning Sybase IQ for Unstructured Data Summary Question And Answer April 20, 2019
What Is Unstructured Data? Freeform binary or textual data that has meaning to a user or an application, but not to the database Why is it unstructured? Format is not known to the RDBMS Cannot be indexed due to the unstructured nature Large object types Binary – Generally referred to as a BLOB (Binary Large Object) Text – Generally referred to as a CLOB (Character Large Object) Often referred to as a LOB or LOBs April 20, 2019
What Is Unstructured Data? Can be any type of data you desire Images Video Sound Scanned or electronic documents Applications Company specific data format It has to make sense to you, not the database! April 20, 2019
Why Should I Place Unstructured Data In Sybase IQ? Single location for application data The application can be written to access all structured and unstructured data using a single interface The DBAs have control over all data for the application, not just the structured data Filesystems are generally less efficient at handling large sequential files, causing more disk I/O In order to achieve performance from filesystems, additional software and training is necessary Requires tight coordination with system administrators Application can be taken down for non-database problems Filesystems going offline Storage filling up April 20, 2019
Why Should I Place Unstructured Data In Sybase IQ? Generally requires less storage Filesystems cannot naturally compress the datafiles IQ will compress the LOB data just as it compresses other data Each IQ data page is run through a compression algorithm to compress the contents of the page into one or more disk blocks Filesystems can have as much as 10% overhead Inode table Reserved space for the “root” account Reading from filesystem must be done sequentially The application can be designed to return portions of the data in parallel Sybase IQ loads are done in parallel April 20, 2019
Understanding How Sybase IQ Stores The Data Each object is stored in one or more data pages If an object needs a portion of a page, the entire page is marked as in use for that LOB Each page is then put through the IQ compression algorithms to squeeze as much storage savings as possible No other indexing or changes take place on the LOB data April 20, 2019
Sybase IQ Limitations for Unstructured Data There is no database limit on the number of LOB columns that can exist Limited only by the total number of tables and columns in a database A table can have an unlimited number of LOB columns Limited only by the total number of columns allowed on a table Each LOB columns is virtually unlimited in size Maximum LOB size is 4GB * page size Absolute maximum is 4GB * 512KB = 2 PetaBytes Most RDBMS engines are limited to 2-4 GB per entry April 20, 2019
Creating, Loading, and Extracting Unstructured Data Added new datatype “long binary” to support LOB data Create a table with LOB support create table blob_data( file_id int primary key ,filename char(64) ,ext char(6) null ,file_size unsigned bigint ,lobcol long binary null ) LOB Column April 20, 2019
Creating, Loading, and Extracting Unstructured Data New syntax added to support loading LOB data “BINARY FILE()” added to column specification Data for the binary file location must be delimited LOB data does not exist in primary data file Primary data file is a pointer to LOB data being loaded April 20, 2019
Creating, Loading, and Extracting Unstructured Data Load a table with a LOB column set temporary option load_memory_mb=50; go load table blob_data ( file_id ',' ,filename ',‘ ,ext ',‘ ,file_size ',' ,lobcol binary file ( ',' ) ,filler(1) -- change this to filler(2) for Windows Data ) from ‘blob_file.dat’ format ascii preview on quotes off escapes off commit LOB Column April 20, 2019
Creating, Loading, and Extracting Unstructured Data Sample blob_file.dat data 1,boston,jpg,/s1/loads/lobs/boston.jpg,1234, 2,map_of_concord,bmp,/s1/loads/maps/concord.bmp,321, 3,zero length test,NULL,,123, 4,null test,NULL,NULL,456, Notice that there is no LOB data in the file The IQ load engine will open the secondary files and load them into the appropriate rows Row 1: /s1/loads/lobs/boston.jpg Row 2: /s1/loads/maps/concord.bmp Rows 3 and 4 will be NULL April 20, 2019
Creating, Loading, and Extracting Unstructured Data More LOB loading notes… If the file cannot be opened for any reason a NULL is loaded instead and no error is thrown An error will be thrown if the column is created with “NOT NULL” LOB data can only be loaded from an ASCII primary data file (e.g. blob_file.dat, map.jpg) The ASCII file can be fixed width, however, the secondary file name column MUST have a delimiter after it April 20, 2019
Creating, Loading, and Extracting Unstructured Data Three methods to extracting LOB data Return the entire contents of the column Return a portion of the column Extract the column to disk To return the entire contents of the column to the client application select lobcol from my_table where file_id = 1 To return a portion of the column to the client application select byte_substr64( lobcol, 1, 1000 ) from my_table where file_id = 1 Will return bytes 1 through 1000 of the LOB select bfile ( ‘/tmp/my_lob.dat’, lobcol ) from my_table where file_id = 1 Will directly write the entire row/column to disk on the IQ server host April 20, 2019
New Functions BYTE_LENGTH64( long binary ) Returns the total bytes contained in the LOB column select byte_length64( lobcol ) from my_table where file_id = 1 BYTE_SUBSTR64( long binary, offset, length) Returns a portion/substring of the LOB data select byte_substr64( lobcol, 1, 1000 ) from my_table where file_id = 1 BFILE( filename, long binary ) Extract the contents of the LOB column to filename Each row should have a unique filename The filename can be any string manipulation If the filename is non-unique the contents will be overwritten select bfile( filename + ‘.’ + ext, lobcol ) from my_table where file_id = 1 April 20, 2019
Sybase IQ 12.6 Functionality New domain BLOB created for the “long binary” datatype New domain CLOB created for the “long varchar” datatype Single- and Multi-byte support added for the CLOB datatype New load table syntax Secondary_File_Error option specifies the desired error handling when an error occurred during opening/reading a secondary file Supported syntax UPDATE INSERT..VALUES INSERT..SELECT LOAD DELETE TRUNCATE SELECT..INTO INSERT..LOCATION SELECT April 20, 2019
Tuning Sybase IQ for Unstructured Data CORE_Options14 This sets the number of threads used to read an individual lob secondary file. Measurements have been inconsistent due to vast hardware differences and dependencies. A value much larger than 3 has shown to cause system time to be too high. Start with a value of 3 and tune from there. CORE_Options15 This sets the number of threads in the lob load team. A value of zero will use 1 thread per cpu, if available. A non-zero value of n sets the team size to n, if available. The lob load threads have to live within the other settings such as max threads per connection, max team size, etc. On a machine with many CPUs, this may need to be set manually. April 20, 2019
Tuning Sybase IQ for Unstructured Data FP_LOB_Workunit_MBSize This defaults to 100 MB and there is probably no reason to change it. It is the amount of data each thread reads in a single unit of work when loading lob data. Loading LOB data is very sensitive to I/O and CPUs Have sustained load rates of over 2 GB per second! That’s 7 TB an hour Had 72 CPUs and enough I/O controllers to handle the throughput Loading LOB data is not memory intensive A minimally configured IQ server (1-2 GB RAM total) is sufficient to load LOB data Load_Memory_MB is not used heavily for LOB data April 20, 2019
Large Objects (LOBs) Existing LONG BINARY columns created before 12.5ESD8 must be dropped before 12.6 installation LONG BINARY size limit is between 512TB and 2PB (depending upon IQ PAGE size) – that is per row! LONG VARCHAR now has the same “restrictions” as LONG BINARY ASE TEXT can be inserted into LONG VARCHAR A whole new manual entitled Large Object Management in Sybase IQ April 20, 2019
Summary A fast, compressible way to store unstructured data in an RDBMS Compression ratio will vary, but is guaranteed to be no larger than the original file without the O/S overhead Data can be written and read in parallel Remove the need to rely on resources outside the database administrators and developers Guaranteed performance since all resources are under the DBA’s control April 20, 2019
About things - End April 20, 2019