A review of Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?

Posted in reviews by Christopher R. Wirz on Fri Feb 16 2018



Access Path Selection (APS) for filtering operations has been a key component of database systems. This paper compares sequential scans and index scans of columnar group storage.

Note: This review is in the format of the CS265 Discussion format and may reflect my personal opinion.

1. What is the problem?

When incorporating an access path selector, the optimizer needs to account for both hardware and workload. Workload is monitored with time - creating choices between a vectorized scan or secondary index scan. Also, scans need to be shared. If the underlying data structure changes, the access pattern needs to adapt as well.

2. Why is it important?

Adaptive access patterns can handle a variety of workloads. An index change, layout change, or hardware change can take place, and the decisions will adjust for optimum performance. The optimal scan boosting technique will be selected, and concurrent access is supported.

3. Why is it hard?

If not done right, optimization can become a bottleneck. When the data can be kept hot in memory, scan performance improvements, such as skipping and zone maps will drive lower response times. When queries are already fractions of a second, there might not be enough time to perform access path optimization.

4. Why existing solutions do not work?

Secondary indexes, typically in the form of B+-Trees are often seen in row-oriented systems. Further, access paths decided on thresholds based on hardware properties, system design, and data layout based on a static tuning for all future queries. Access paths are often tuned for memory, compression cna reduce that footprint, and vectorized processing reduces interpretation logic. Also, scans can be shared when attributes are processed by multiple queries.

Some modern analytical systems do not opt for secondary indexing due to the fact that there is enough memory available for all the data - removing the need for disk transactions.

5. What is the core intuition for the solution?

In addition to the query selectivity, the underlying hardware, and the system design, modern optimizers also need to take into account query concurrency. Hardware, queries, data layout, and available scan enhancements are modeled in the access pattern decision engine. Additionally, allowing multiple queries sharing a single memory scan is taken into consideration.

6. Does the paper prove its claims?

The authors show both analytically and experimentally that secondary indexes are still useful for concurrent queries and queries of low selectivity. In the analytical approach, the authors develop each equation to define the terms of the overall cost minimization. The ration Access Path Selection (APS) can be resolved to a >=1 (for a scan) or <1 (for a secondary index).

7. What is the setup of analysis/experiments? is it sufficient?

The main system is comprised of 4 2.0GHz 8 core Intel Xeon E7-4820 v2 Ivy Bridge processors with 16MB of L3 cache with 1TB of DDR3 (@1066MHz), and four 300GB 15K RPM disks configured in a RAID-5 array. The experiment uses a synthetic data set of uniformly distributed data. The size of the data set is varied between 100M and 500M tuples. Concurrency is varied from 1 to 512. Due to the size of the data and the size of the memory, the data can be held in the main memory. This setup is adequate to address the hypothesis that faster reading favors scans.

To show the influence of hardware, the authors use three Amazon EC2 instances:

  1. a general-purpose instance (m4.4xlarge) with 16 virtual cores (Intel Xeon E5-2676 v3) and 64GB of RAM
  2. a compute instance (c4.8xlarge) with 36 virtual cores (Intel Xeon E5-2666 v3) and 60GB of RAM
  3. a memory instance (r3.8xlarge) with 32 virtual cores (Intel Xeon E5-2670 v2) and 244GB of RAM
These instances are representative of what would be used in industry and are sufficient for these sets of experiments.

8. Are there any gaps in the logic/proof?

The authors do not apply constants for unit conversions in the equations of the analytical approach. Additionally, there was not much discussion regarding network and access selection.

9. Describe at least one possible next step.

The authors should next experiment with a clustered solution. This will allow their decision engine to account for network congestion as well. Also, it would represent a full solution at scale. It would be interesting to see if the same access path was selected at each node for the same query.

BibTex Citation

@inproceedings{kester2017access,
  title={Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?},
  author={Kester, Michael S and Athanassoulis, Manos and Idreos, Stratos},
  booktitle={Proceedings of the 2017 ACM International Conference on Management of Data},
  pages={715--730},
  year={2017},
  organization={ACM}
}