A review of Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

OLTP (On-line Transaction Processing) databases are designed for fast on-line transactions (INSERT, UPDATE, DELETE). OLAP (On-line Analytical Processing) deals with Historical Data or Archival Data and few transactions. Traditionally, these databases have different technical requirements, but through an innovative compressed columnar storage format, the memory footprint can be reduced and throughput can be improved.

Note: This review is in the CS265 format and may reflect my personal opinion.

1. What is the problem?

This paper proposes a method to reduce the memory footprint of OLTP and OLAP database systems without reducing the throughput (after all, memory operations are faster than disk operations). In order to accomplish this, a storage scheme is required to work well with both memory and disk. This means the compression has to be tuned for performance and the data has to be indexed, but in moderation such that stored tuples are readily available to transaction processes.

When you think of OLTP, you think of a row store. When you think of OLAP, you think of a column store. These are literally orthogonal design patterns. Hybrid solutions at the present are not concerned with resources.

2. Why is it important?

Using Just In Time compilation of queries (making executable code) avoids the overhead of query interpretation across a "vectorized" execution. This is due to compiler optimizations with respect to loops. Combined with a low-compression column store format, the HyPer database can perform well in both OLTP and OLAP transactions. This means that the state and storage can be managed by the same system and still have high performance.

JIT compiling allows one to turn a higher level understanding into machine code. This helps eliminate branches in the machine level code. Pre-compiling Vectorwise code allows for optimal execution across the vector.

3. Why is it hard?

OLTP and OLAP operations have very different technical requirements down to the hardware level. For example, compression improves writing speed, but reduces read and query performance. So, to keep access fast, high-transactional systems do not use tuple compression. While single instruction, multiple data (SIMD) algorithms (such as bit-packing) have shown promise with early filtering, they lack in decompression.

There is a balance between bandwidth and computational power. The compression cannot be so expensive that it bottlenecks the movement of data. This balance depends on hardware, data, and compression algorithms - all of which are changing.

4. Why existing solutions do not work?

JIT compilation relies on CPU registers while vectorization passes data through memory. However, vectorized scans can be pre-compiled and perform well with SIMD algorithms, which can be further optimized. Normally, these two approaches have these distinct benefits. The authors, however, use vectorized predicates that takes advantage of new hardware features such as Advanced Vector Extensions (AVX) and Streaming SIMD Extensions (SSE). This could nearly have the total number of instructions compared to existing solutions.

5. What is the core intuition for the solution?

The authors introduce a new positional type of Small Materialized Aggregates (SMA), which they call PSMAs. As optimal compression techniques are chosen for each column in the block, the PSMA light-weight indexes provide the min and max of the block, allowing the datablock to be skipped during a scan. If the block cannot be skipped, the scan range can be narrowed by the PSMA indexes.

The interpreted vectorized scanning techniques conform the data to an interface before it is sent to the JIT compiler. The interfaces and techniques are built into HyPer. Performing this analysis on tuples, vs vectors, creates too many code paths.

6. Does the paper prove its claims?

Yes. The authors showed experimental which incrementally evaluated the contribution of each of the four main HyPer features and recorded the data for 22 different query experiments. The authors further discuss HyPer adaptivity around evaluation of how binary decisions of certain query arguments are not SARGable due to the complexity of the code paths that create high JIT compile times. This was a large challenge of JIT query compilation.

The authors should make an effort to post their code in a public repository such as github.

7. What is the setup of analysis/experiments? is it sufficient?

The authors have developed a database system called HyPer. Hyper was originally built with a Just In Time (JIT) compiling query engine featuring a compressed column storage format they refer to as data blocks. Hyper is comparable to Drill and Impala databases, so there are at least other published contenders to compare results against.

8. Are there any gaps in the logic/proof?

Overall, the authors made four main changes to the HyPer database.

Change in the structure of the Data Blocks
Change in Data Block compression and indexing required to support said compression
Use and optimization of predicate evaluation on vectorized operations
Integrates multiple storage layout combinations

The authors claim a speed up on Search Argument scan-able Data Blocks (the ability to use search arguments on a Data Block is referred to as SARGable). While compared to vectorized scans, the HyPer system takes less than 1/3 the total time, the realization seems to be that JIT query compilation is the greatest driver of performance. The four main changes discussed in this paper only contribute an extra 7.5%. So, the results are somewhat misleading in that the results include a practice (JIT query compilation) that exists in other database systems.

As another observation, the vector size of HyPer is set to 8192 (2^12), but Figure 10 is the only experimental representation of this number, and its results do not give a compelling reason why the vector length is such. Figure 13 in Appendix A illustrates a possible motive for this size selection, but does not explicitly identify the mathematical correlation between vector size and cache size. There is a tradeoff between the amount of metadata and the usefulness of the block. The process of achieving this balance should be better discussed.

9. Describe at least one possible next step.

One possible step would be to tune vector size on different systems. Additionally, the authors could discuss support for user-defined functions and stored procedures - typical practices used to improve database transaction speeds by database administrators. Such discussions might lead the authors to expand the features' independent benefits to Map and Reduce operations.

The Haswell CPU series was released in 2013, but the 2016 Goldmont features branch prediction. This may allow for more SARGable query arguments as the CPU is more capable of reducing the complexity of binary decisions. An investigation into this adaptivity may allow for HyPer to take advantage of more modern processor microarchitectures.

The three types of compression are truncation, dictionary, and single value. Additional research could explore novel compression techniques specific to this application.

BibTex Citation

@inproceedings{lang2016data,
  title={Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation},
  author={Lang, Harald and M{\"u}hlbauer, Tobias and Funke, Florian and Boncz, Peter A and Neumann, Thomas and Kemper, Alfons},
  booktitle={Proceedings of the 2016 International Conference on Management of Data},
  pages={311--326},
  year={2016},
  organization={ACM}
}