Database Systems Carlos Ordonez
What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms; RAM-efficient data structures at two storage levels Efficiency? Desirable O(n) I/O Hardware? Small computer, single server, parallel DBMS server, parallel cluster; 1 disk, RAID Infrastructure? DBMS, parallel system Boring? Theory+programming
Database systems research today Transaction processing? done Efficient querying? done Fast external algorithms? Simple tasks. Parallel computation? Well proven DBMS shared- nothing, but still many challenges (big data). Exploiting new hardware? Difficult, low level Analyzing? Most difficult: data mining, statistics Future? Big data
DB Systems involves Core CS research: Theory+Programming Theory we use: –Time complexity, I/O cost models –Large data structures; especially external –Relational model is here to stay –Multivariate statistics, machine learning, discrete math –Numerical methods: linear algebra, optimization –Compilers: parsing/compiling/optimizing code; recursion Programming (even some hacking): –Systems in a broad sense –Languages: C, C++; efficiency, pointers, legacy systems code; Java, C# mainly for portability –Numerical libraries like LAPACK, OS thread libraries –DBMS SQL UDFs API with C, C++, C#
Research topics GOAL: Integrating statistical and machine learning algorithms with a DBMS (external algorithms, queries, UDFs) Difference with machine learning algorithms: Size, external algorithms (small RAM), queries, low level optimization, generally simpler models Main topics by students: –Zhibo Chen: OLAP cubes, parametric statistical tests, cube ops on flash memory –Mario Navas, Naveen Mohanam: Singular Value Decomposition for PCA and ML Factor Analysis, data summarization on multicore CPUs –Carlos Garcia-Alvarado: keyword search across docs and db, ranking, query recommendation –Sasi Pitchaimalai: Bayesian classification, multithreaded summarization –Wellington Cabrera: stochastic search variable selection on high dimensional data, SVD on high-d data –David Matusevich: Hybrid EM and MCMC mixture models on large data sets, database transformations for data mining
Representative problems OLAP cubes Finding predictive association rules Bayesian classification Cluster, PCA and regression
Why is our database systems research “cool”? Theory+Programming Optimization, O(f(n)), systems (external data structures, discrete math, compiler, OS) Goes from hardware-level stuff (multi-core, cache memory), to high-level query optimization in SQL Database systems techniques are used in search engines like Google and Yahoo (and vice-versa) DBMS technology used everywhere
Why join DBMS group? Balance between theory (math) and programming We target “DB systems” conferences: ACM SIGMOD and “IR/DM” conferences ACM CIKM (IR+DB+DM) Mature and stable CS research area Job/internship: many opportunities in DBMS and search engines; Job security on any large company Visit my web page, DBLP. Google “Ordonez SQL”