A review of H2O: A Hands-free Adaptive Store

The choice of a row-store or column-store architecture is not a universally good solution - as different workloads require different layouts.

Note: This review is in the format of the CS265 Discussion format and may reflect my personal opinion.

1. What is the problem?

The way data is stored dictates how it is accessed - and therefore is a limiter of performance. For example, row-stores are better for writes and column-stores are better for writes - but neither is universally good. The problem is that even high-performance Data Base Management Systems (DBMSs) struggle when presented with a variety of workloads. This should not be a surprise with a fixed data layout and a fixed execution engine.

2. Why is it important?

Support for multiple storage systems means a database can support multiple workloads better. Normally, databases are tuned to match the workload. As the data set grows, the queries can become more complicated. While this stresses both the capabilities of the management engine and the DBA, H20 aims to alleviate the requirement for tuning through its design for adaptive access.

3. Why is it hard?

The type of data matters. For example, a table with many columns will read better from a row store. However, for a table with fewer columns, a column-store is generally faster.

Both the storage layout and the execution strategy must be chosen to provide the right combination for optimal performance.

4. Why existing solutions do not work?

While some databases support multiple storage engines, which exhibit certain access advantages, these storage engines cannot communicate. This has to do with the execution engine working only for the data format it supports. Also, some workloads oscillate and most adaptive databases do not account for that.

5. What is the core intuition for the solution?

H20 incorporates multiple storage layouts, but abstracted by a single engine. Second, it adaptively selects, during query processing, which layout is best for the query and data parts. The data is materialized in different patterns depending on the workload. Also, the storage and access patterns are tailored to the properties of the query.

H20 does not make any fixed decisions regarding layout or execution. Everything is adaptive with respect to the workload. Essentially, H2O distinguishes between the SELECT and WHERE clauses and provides access patterns based on each. The new layout is not generated until it is actually queried.

6. Does the paper prove its claims?

First, the authors compare two state-of-the-art row-store and column-store databases. The results show that the row store has higher access time than the column-store until a large number of attributes are accessed. Further, the aggregation time is consistently lower for the column-store - most likely due to the organization of attributes into tuples and pages as well as the lower interpretation overhead for column stores. Neither database is optimal throughout the entire experiment.

The authors show that H20 selects the access for best performance given both demonstrative and real-world workloads. This is accomplished by first showing that fixed data layout approaches do not handle workloads with high variety. Then, adaptive layouts and adaptive query execution are presented and shown to handle changing workloads. Later, it is shown that refining the layouts provide good access performance. Finally, H20 is compared to various databases with static data layouts.

7. What is the setup of analysis/experiments? is it sufficient?

All experiments are conducted in a Sandy Bridge server with a dual socket Intel(R) Xeon(R) CPU E5-2660 (8 cores per socket @ 2.20 GHz), equipped with 64 KB L1 cache and 256 KB L2 cache per core, 20 MB L3 cache shared, and 128 GB RAM running Red Hat Enterprise Linux 6.3 (Santiago - 64bit) with kernel version 2.6.32. The server is equipped with a RAID-0 of 7 250 GB 7500 RPM SATA disks. Also, H2O was written in C++.

The authors use both fine-tuned micro-benchmarks and the real-life workload SDSS from the SkyServer project. In the micro-benchmarks, 100 million tuples with 150 attributes of random integer values were used. A sequence of 100 queries were executed. The use of both tests is important (and sufficient) because the micro-benchmarks are designed to stress test the database, while the SkyServer project represents real-world data.

8. Are there any gaps in the logic/proof?

Most relational data does not change in terms of number of columns. The application of adaptive data organization seems to apply to a small subset of real-world cases, or the authors do not address the settling time by which H2O reaches a static layout. However, the authors do discuss handling of oscillating workloads, which will lead to a slow progression of access patterns which should converge to a performant-optimal design, though it is not directly examined.

9. Describe at least one possible next step.

The authors should discuss the use of user-defined functions in query execution. This is a common DBA use-case, along with queries and read-only materialized views. Particularly, materialized views may push the data-layout selection to the far right for atomic normalized tables. It would be a meaningful study to see which balance or imbalance H2O selects.

BibTex Citation

@inproceedings{alagiannis2014h2o,
  title={H2O: a hands-free adaptive store},
  author={Alagiannis, Ioannis and Idreos, Stratos and Ailamaki, Anastasia},
  booktitle={Proceedings of the 2014 ACM SIGMOD international conference on Management of data},
  pages={1103--1114},
  year={2014},
  organization={ACM}
}