A review of Fast Scans on Key-Value Stores

Key-Value stores (KVS) are popular because they can scale horizontally with high throughputs and low latency. Key-Value stores are predictable in that put and get requests complete in constant time. While they are simple in design and layout, the cost comes from analytical queries and a lack of efficient ways to scan data.

Note: This review is in the CS265 format and may reflect my personal opinion.

1. What is the problem?

SQL-over-NoSQL is the concept of allowing analytical queries over a key-value store. The problem is allowing this architecture to support Online transaction processing (OLTP) and Online analytical processing (OLAP) workloads over the same KVS. OLAP and OLTP access patterns are different.

2. Why is it important?

The big advantage of the SQL-over-NoSQL architecture is that it is elastic (like a KVS). Scaling can occur both horizontally and vertically - adding more machines for processing and storage. Additionally, processing nodes could switch to storage nodes if the analytical workload relaxes. Typically, a KVS does not have acceptable analytical performance and analytical systems do not support near-real-time performance.

3. Why is it hard?

Supporting get/put and scan operations leads to conflicting goals in access patterns. To make scans fast, data needs to be closely located, but for putting and getting, performance is better if the data is distributed evenly - or more spread out. Consider the use of a hashtable for a key-value store: it is great for locating (get/put) data evenly within a data structure, but it requires all data to be read in order for a scan. While this is an extreme example, it illustrates that point that the goals and access patterns are conflicting.

Concurrency also adds the challenge of versioning. Returning the right record, depending on the beginning of the transaction, involves attention to timestamp. Finally, old entries, that have a non-current timestamp earlier than any process, must be garbage collected - and garbage collection can reduce performance.

4. Why existing solutions do not work?

Most KVS do not have such capabilities and those that do, cannot execute scans with acceptable performance. Kudu is a column-oriented KVS designed for both OLTP and OLAP workloads, but sub-second latency is still expected. This is not desirable for real-time performance.

5. What is the core intuition for the solution?

TellStore arranges records following the Partition Attributes Across Paradigm (PAX). By keeping the field size of a table fixed, the system only needs to know the location of record beginning. Given this information, the attribute position can be computed by its size. Fields with undetermined sizes are stored in a heap, directly after the rest of the attributes in row-oriented format.

6. Does the paper prove its claims?

The authors compare TellStore (the proposed solution) to Cassandra, RAMCloud, HBase, RocksDB, Kudu, and MemSQL. This first illustrates the advantages of TellStore of state-of-the-art solutions with comparable goals. From there, the authors support each design decision of TellStore, with proper consideration to trade-offs and analysis of alternatives.

7. What is the setup of analysis/experiments? is it sufficient?

The experiments were conducted on a small cluster of 12 machines. Each machine has 2 quad core Intel Xeon E5-2609 2.4 GHz processors), 128 GB DDR3-RAM and a 256 GB Samsung Pro SSD. The systems had a 10 GbE NIC. All the KVS we benchmarked are NUMA-unaware. This is a valid configuration as it represents a typical configuration used in industry.

Whenever the authors compared TellStore to other popular KVSs, they used the column variant. This approach ensured the greatest controls when comparing TellStore to other systems. The authors used TPC-H benchmarks, which represents real industry operations and data.

8. Are there any gaps in the logic/proof?

The authors do not discuss and experiment with the various tuning options of popular state-of-the-art KVS competitors. The role of a DBA in tuning the Data system is not considered - which may have had an impact on the comparison results. Also, series results with order of magnitude differences may be plotted with a y-axis log scale to show the trend of values closer-to-zero (such as the results in Section 7).

9. Describe at least one possible next step.

A next step might be to limit (or increase) the number of processing nodes such that performance can be measured as a ratio of processing to storage nodes. The timestamp field can be converted to a unsigned long for faster comparison when versioning during garbage collection. Another step would be to have a groups-of-columns layout variant of TellStore. Finally, TellStore should support replication to improve reliability - but configure the replica nodes for scan-only operations.

BibTex Citation

@article{pilman2017fast,
  title={Fast scans on key-value stores},
  author={Pilman, Markus and Bocksrocker, Kevin and Braun, Lucas and Marroqu{\'\i}n, Renato and Kossmann, Donald},
  journal={Proceedings of the VLDB Endowment},
  volume={10},
  number={11},
  pages={1526--1537},
  year={2017},
  publisher={VLDB Endowment}
}