A review of VectorH: Taking SQL-on-Hadoop to the Next Level

Actian Vector in Hadoop (VectorH) is an SQL-on-Hadoop system built on a Vectorwise system designed for high performance in analytical database applications. The idea behind vectorization is that you are operating on large segments using the same operator. Vectorwise operations are preferable over column stores.

Note: This review is in the format of the CS265 Discussion format and may reflect my personal opinion.

1. What is the problem?

SQL-on-Hadoop systems struggle because they have long-running processes. While YARN can help improve scheduling policies, it is not viable to send a query to YARN because the query should run on a thread that is part of the active server process. Going out-of-process increases latency.

The biggest problem is taking the Vectorwise system, which is designed for a single server, and converting to VectorH, which is a shared-nothing product. The goal is to keep the vectorized processing

2. Why is it important?

SQL-on-Hadoop provides many benefits of a MPP database (massively parallel processing database) - while Hadoop is analogous to the dedicated hardware of a traditional MPP system. Often times we try to scale vertically to increase performance, but if Hadoop is the hardware, we can scale horizontally. An MPP database is optimized to be processed in parallel for many operations over many processing units at a time, but these processes may be reads OR writes. Therefore, the large amount of data typically limits concurrency.

3. Why is it hard?

SQL-on-Hadoop runs on a cluster of nodes. This means that data must be distributed across all nodes - which run the process in parallel across nodes and the cores within each node. Even in a regular MPP dividing tasks among resources is a hard task.

Not only is this challenge exacerbated over a cluster, but also by the many processes and Hadoop jobs that take resources - including now network and additional disk. Larger systems have hundreds of nodes, and if each have 20 cores, that is 2000 buffers - leading to an increased number of translation lookaside buffer (TLB) misses. Since the TLB stores translations of virtual memory to physical memory - the alternative is a much more time-consuming page-walk. Further, this leads to buffering many gigabytes.

4. Why existing solutions do not work?

Previous solutions implement some or all of the following techniques:

Column stores to reduce the amount of read-write and memory footprint
Data compression (such as in "Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation") to reduce the memory footprint
data-skipping materialized indices (such as MinMax) that identify characteristics of the data before being read in to the query interpreter
Use of hardware-optimized query logic

The concept of using Hadoop (or HDFS) as the foundation is relatively new and helps aid in horizontal scalability and replication.

However, in the context of many node systems working on HDFS, space could only be freed by re-writing the table fully. A combination of inserts and deletes can lead to wasted space. Appending to a table causes many files to be read. VectorH solves this problem by splitting data files horizontally into chunks of 1024 blocks.

5. What is the core intuition for the solution?

The authors achieve SQL on Hadoop by

Managing the block size of Hadoop to better support updates to data
Instrumenting Hadoop to optimize read locality by determining where blocks are replicated by assigning a responsible datanode
Integrating YARN for workload management and horizontal scalability / elasticity

YARN is used to as the resource manager - and VectorH an implementation of Vectorwise based on YARN.

VectorH is the first SQL-on-Hadoop system to measure the performance and placement of HDFS blocks such that the distribution attempts to co-locate column files. This can dramatically decrease read-times on a distributed or dedicated system. To accomplish this, updates occur on Positional Delta Trees (PDTs) - which provides attributes such as MinMax for skipping during scans.

6. Does the paper prove its claims?

The paper reads more like a design summary than an application of the scientific method. In that sense, all the design decisions were purposefully described. The authors do a good job of describing why each design decision supports the goals of this paper. For example, the use of Vectorwise's Positional Delta Trees are used to store insert, delete, and update modification in their leaves in order to have access to both the stable id and current id of the tuple.

Major design decisions are justified early through comparative benchmarks. Figure 1 illustrates that VectorH exhibits good compression and good speed compared to Parquet and Orc formats. This establishes a sense of VectorH's contribution to performance and validates the design roadmap going forward in the paper. Figure 7 (which is a table) shows that VectorH is around 90x faster than HAWK, SparkSQL, Impala, and Hive with the TPC-H results set.

7. What is the setup of analysis/experiments? is it sufficient?

The authors evaluated VectorH on a 10 node system, and compared it to the latest versions of HAWQ, Hive, Impala, and SparksSQL. The nodes all have two Xeon E5-2690 v2 CPUs (20 real cores) and 256GB of RAM. Each was connected via a 10Gb Ethernet card and had 24 600GB disks.

TPC-H is a decision support benchmark suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data population represent a broad industry-wide relevance.

8. Are there any gaps in the logic/proof?

In the introduction, the paper states how a clustered system has the added challenge of minimizing network congestion. To address this, the authors use Messaging Passing Interface (MPI). Also, the experiment used 10Gb Ethernet cards, which would probably mask the early onset of network congestion.

9. Describe at least one possible next step.

The following data types are currently supported for vectorized execution (supported in Vectorwise):

tinyint
smallint
int
bigint
boolean
float
double
decimal
date
timestamp
string

However, user defined functions do not run as fast as native arguments. The authors should address an adaptive planner which places user defined function evaluation later in the plan (if possible) to benefit from faster vectorized operations earlier.

Hadoop (and many of the other technologies used) are designed to run on commodity hardware - meaning cheap and easy to find. The authors of this paper used high end hardware that will hide any bottleneck effects of commodity hardware. While 7 years from now, their setup might be considered commodity, they should show the effects of stress testing TPC-H results on a commodity cluster.

Finally, the authors should spend more time analyzing the network traffic between nodes. As a consideration for cluster-based big data systems, this did not receive a lot of attention in this paper.

BibTex Citation

@inproceedings{costea2016vectorh,
  title={VectorH: taking SQL-on-Hadoop to the next level},
  author={Costea, Andrei and Ionescu, Adrian and R{\u{a}}ducanu, Bogdan and Switakowski, Micha{\l} and B{\^a}rca, Cristian and Sompolski, Juliusz and {\L}uszczak, Alicja and Szafra{\'n}ski, Micha{\l} and De Nijs, Giel and Boncz, Peter},
  booktitle={Proceedings of the 2016 International Conference on Management of Data},
  pages={1105--1117},
  year={2016},
  organization={ACM}
}