A review of Morsel-Driven Parallelism

As computer hardware continues to evolve, we can revert using tricks from the '70s to drive extra performance from our data systems. In this article, the paper "Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age" is reviewed and discussed.

Note: This review is in the CS265 format and may reflect my personal opinion.

1. What is the problem?

Within the architecture of many database systems, query execution does not take advantage of multiple cores. When taking advantage of multiple cores, dividing the work across cores is challenging due to cores operating out of order. Further, memory controllers on the chip leads to Non-Uniform Memory Access (NUMA). This research employes parallelization while taking RAM hierarchies into account.

2. Why is it important?

As mainstream servers may have 30 core - equating to 120 threads - allowing a query to scale elastically across threads will allow for faster overall execution speed. The optimizer of a database attempts to create a plan that executes identical query segments across multiple threads, allowing for the data to be divided into sections of the data that will be re-connected after the query runs on each thread and respective section. This research changes the pipline planning and dispatches constant segments of tuples such that threads are pinned to cores and sections of memory.

The proposed Morsel-driven query processing takes small fragments of data and is therefore better schedules across workers in a Non-Uniform Memory Access (NUMA) setting (which happens when a thread changes cores). The dispatcher distributes work across threads at run-time, making it fully elastic.

3. Why is it hard?

One of the challenges is selecting the right morsel size for a given execution environment. If the size is too small (or too large), the execution time might be too high. Morsels that are too small requires too much work for the dispatcher. Morsels that are too large requires too much work for the RAM.

4. Why existing solutions do not work?

In previous designs, shared state is avoided - which means data is split up on the fly. Shared state has the advantage because it can be accessed from any core and maximizes NUMA-local execution.

5. What is the core intuition for the solution?

A scheduling mechanism (aka dispatcher) establishes flexible parallel execution - even changing the degree of parallelism. Threads are pinned to cores such that there is no loss of NUMA locality.

6. Does the paper prove its claims?

Yes. The authors clearly demonstrate a higher speedup using full-fledged HyPer over non-NUMA aware HyPer over non-adaptive HyPer over Vectorwise scans on Nehalem EX across 22 different queries. Additionally these results show that full-fledged HyPer takes advantage of multiple threads better than other systems. Vectorwise scans are shown to have problems with load balancing across threads.

7. What is the setup of analysis/experiments? is it sufficient?

The authors of this paper are the authors of the HyPer database, and have performed the experiments using this system. The authors test on both 4-socket Nehalem EX (Intel Xeon X7560 at 2.3GHz) and 4-socket Sandy Bridge EP (Intel Xeon E5-4650L at 2.6GHz-3.1GHz) systems. These systems are typical for main-memory databases. Sandy Bridge systems have twice the per-node memory bandwidth, but are believed to have latency in memory access due to movement between CPU sockets.

8. Are there any gaps in the logic/proof?

Yes. This paper states the Morsel method has advantages over the Volcano method without directly comparing the approaches. Also, the synchronization tagging is claimed to be better than Bloom filters, but only cites the filter size drawback of the Bloom filter as the rationale.

While the authors show that query execution time decreases as morsel size increases to 10k, the x axis should be powers of 2, not 10. Most other plots have an x-axis in base 2.

9. Describe at least one possible next step.

GPUs have many more cores than CPUs. This approach should be tested and compared with running workers on a GPU core. Given the abstraction provided by CUDA libraries, there would not be any further hardware-specific parameter tuning than for number of workers and morsel size. A 2-D array of results would be easy to visualize and would illustrate patterns in GPU processing of database scans.

The authors should re-run the 22 queries with the Volcano execution framework. Volcano uses tuple-at-a-time execution, where HyPer is batch-oriented.

BibTex Citation

@inproceedings{leis2014morsel,
  title={Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age},
  author={Leis, Viktor and Boncz, Peter and Kemper, Alfons and Neumann, Thomas},
  booktitle={Proceedings of the 2014 ACM SIGMOD international conference on Management of data},
  pages={743--754},
  year={2014},
  organization={ACM}
}