A review of Skipping oriented Partitioning for Columnar Layouts

As the volumes of data in the average database system (DBS) continues to grow, modern database systems use mechanisms such as data skipping to improve performance.

Note: This review is in the format of the CS265 Discussion format and may reflect my personal opinion.

1. What is the problem?

Data skipping approaches are generally with respect for row-oriented systems in order to speed up analytical query performance. While column-oriented solutions exist as well, it is ultimately the workload that is the true driver of performance with respect to data skipping. Also, measuring the effectiveness of data skipping is a challenge, because to know if any data has been missed, all the data skipped has to be read when the effectiveness is measured.

Often, skipping is accomplished by adding metadata onto data blocks such that queries know whether or not to scan the block. The solution of Skipping-Oriented Partitioning offers much promise by storing query evaluations in feature vectors, however, the filter usage is highly skewed (some filters being used much much more than others). While data layouts can be designed to benefit query filters, this might not reflect the OLTP/OLAP priority and balance of the actual workload.

2. Why is it important?

For extremely large data sets, the data spans not only a disk but a series of nodes. It is unreasonable to scan all the data on each node when looking for a given tuple. Therefore, having a sense of where the data is and skipping the data that isn't needed will greatly decrease the analysis time.

3. Why is it hard?

In the process of partitioning, tuples must be reconstructed. The reconstruction process can incur a noteworthy amount of overhead. While there are combining row and column partitioning has skipping benefits, it must be considered against the tuple reconstruction trade-offs. Additionally, optimization models should be as close to closed-form as possible in order to minimize the overhead used in partition optimization.

4. Why existing solutions do not work?

Modern systems address this problem by creating metadata for the blocks of data. While this saves disk transfers and reduces the CPU work, it does not take into account the actual workload that will be performed. Systems that combine the predicates as features in the metadata have been shown to outperform the traditional approach. For wide tables and complex workloads, however, predicate features run into conflict when organized into a single horizontal partitioning scheme.

5. What is the core intuition for the solution?

The authors have developed a framework, which they call Generalized Skipping-Oriented Partitioning (GSOP), that takes into account row-based and column-based trade-offs. While previous solutions use one horizontal partitioning scheme to incorporate all features, GSOP allows for partitioning of columnar layouts such that each column can have a different horizontal partitioning scheme. This is similar to the concept of database cracking, in that tuples are no longer treated as the atomic unit. This reduces feature conflict and provides more fidelity to data skipping - a perspective not as realizable in row-only or column-only layouts.

To overcome the tuple reconstruction issue, GSOP has optimization to balance effective skipping and tuple reconstruction overhead. To do this, GSOP varies vertical (column) and horizontal (row) grouping, where previous techniques would group all the columns and then horizontally partition. In this approach, a model is used to estimate the effectiveness of data skipping. The authors refer to this is a "skipping-aware column grouping technique".

6. Does the paper prove its claims?

The authors use two common benchmarks (Big Data Benchmark, TPC-H) and real-world workloads (TPC-H, Sloan Digital Sky Surveys) to show that GSOP can reduce the amount of data scanned - and therefore improve query response over other modern techniques. In order to do this, GSOP is only applied on 10 million rows at a time - adding a control in that everything else is the same except for GSOP vs SOP. Using this approach, GSOP, on in some tests, performed an order of magnitude fewer reads when a scan is performed. This matches an early statement by the authors that "10% of the filters are used by 90% of the queries". On average, GSOP reduces the data read by 6.5x and improve the query response time by 3.3x over SOP.

7. What is the setup of analysis/experiments? is it sufficient?

SOP was prototyped using on Apache Spark and stored in Apache Parquet. The Spark SQL performance was compared in-place to SOP. This provides more controls than creating a system from scratch.

For the big data benchmark (to look at sensitivity analysis through benchmarking query response), the experiments were conducted on an Amazon EC2 m3.2xlarge instance with Intel Ivy Bridge Processors and 80 GB of RAM, and a 2TB SSD. For the TPC-H and SDSS workloads (in which scale and performance is important), the experiments were conducted on a Spark cluster of 9 Amazon EC2 i2.2xlarge instances, with 1 master and 8 slaves - each with Intel Ivy Bridge Processors, 61GB of RAM, and 2X 800G SSD. This is a sufficient hardware configuration for test as it represents a typical production environment.

8. Are there any gaps in the logic/proof?

The authors do not explain whether or not the data blocks are sorted, based on tuple IDs, when a data block is placed on physical storage. This would allow for both sequential reads and skipping within the block. Instead, the authors discuss merge-sort in memory, where a block that isn't skipped is sorted, and then merged with another block when evaluating a query. Since the authors discuss the use of Hadoop for [possible future] storage, it would make sense to optimize the read efficiency, but this is not discussed.

In each graph provided with respect to the objective function, the authors only evaluate one feature and the objective function. While this trend may continue across the design space, it would be clearer if they provided multi-variable analysis to give a better sense of where in the design space the cross section lies.

9. Describe at least one possible next step.

GSOP highly benefits OLAP database systems, but further research could approach a more balanced system. Since GSOP can handle both horizontal and vertical data, it might be able to keep its processed data as a materialized partition while a horizontally structured partition exists for writing. In the case OLAP transactions on normalized data, the authors could analyze the effectiveness of grouping similar data in larger, sorted column-oriented data blocks.

BibTex Citation

@article{sun2016skipping,
  title={Skipping-oriented partitioning for columnar layouts},
  author={Sun, Liwen and Franklin, Michael J and Wang, Jiannan and Wu, Eugene},
  journal={Proceedings of the VLDB Endowment},
  volume={10},
  number={4},
  pages={421--432},
  year={2016},
  publisher={VLDB Endowment}
}