A review of MacroBase Prioritizing Attention in Fast Data

MacroBase is a data analytics engine that analyzes and aggregates important and unusual behavior - acting as a search engine for fast data.

Note: This review is in the format of the CS265 Discussion format and may reflect my personal opinion.

1. What is the problem?

While big data can be created in volumes through applications or sensors, most report accessing less than 6% of the data that is collected – and often when root cause analysis is taking place. While there are dataflow processing engines available to developers, they still require the developer to configure the contextualizing behavior such as highlighting and grouping. This is the challenge of “fast data” – displaying the most relevant results to the end user such that they can execute quickly and adapt to change.

2. Why is it important?

MacroBase is a solution designed to prioritize analytics such that operational results are explained to the user. It performs both classifying and explaining – which gives the end user a better sense of the properties of the data. MacroBase uses feature extraction and streaming classification and can operate un-supervised, but does allow users to tune their queries with transforms and classification rules. This helps identify relationships across a wide network of systems in which telemetry is present. An example would be the identification of a software version that leads to excessively high power consumption.

3. Why is it hard?

In order to execute as a streaming analytics service, MacroBase must combine both Classification and Explanation operations. Normally, training a classifier takes time, but MacroBase is not afforded this kind of time so it has to train live and unsupervised. Classification is the ability to label data points and Explanation is the ability to describe commonalities among points. Conveniently, knowing commonalities among points does assist in training. Explanation is also used to group results for the end-user such that they are not overwhelmed.

4. Why existing solutions do not work?

Not all previous solutions have been designed to meet the rates of 2M events per second. MacroBase can handle cases from datacenter operation telemetry to industrial monitoring. Systems that can handle massive throughput tend to overwhelm the end user with the information provided. The focus of MacroBase is extensibility, both in how it operates and where it operates – allowing a small number of operators to execute across multiple domains.

5. What is the core intuition for the solution?

MacroBase leverages a novel stream sampler, called the Adaptable Damped Reservoir (ADR), which performs sampling over arbitrarily-sized, exponentially damped windows. This incrementally trains classifiers and handles extreme outliers by tracking correlation between attributes and values. MacroBase also leverages the notion that many streams contain repeated measurements (especially from devices similar versions) – compensating for streams that are of different sizes. This information is added to an Amortized Maintenance Counter (AMC). This improves performance by highlighting the small set of attributes that matters most. Overall, this intuition leads to order-of-magnitude improvements when considering rates of nearly 2M events per seconds.

6. Does the paper prove its claims?

This paper is heavily focused on the features of MacroBase. From that perspective, the authors discuss the architecture of the system in how it ingests data, transforms it, performs classification, aggregates and explains, and presents the result. MacroBase can operate both on streams and stored data and is interoperable through the use of defined interfaces of each operator. It is important to note that joins are performed when the data is ingested – where it is also measured. Transformation is not required for classification when rule-based models are used. Users can add supervised operators as well as pre-trained models. Given this level of flexibility and fidelity, it provides a lot of margin for MacroBase to meet its initial goals and claims. Later, the authors do show the throughput with respect to a stable size of items. All three implementations can achieve at least 2M updates per second.

7. What is the setup of analysis/experiments? is it sufficient?

The MacroBase prototype was run on a server with four Intel Xeon E5-4657L 2.40GHz CPUs containing 12 cores per CPU and 1TB of RAM. The authors take measures to analyze steady state results rather than load time. This adequately shows only the effects of pipeline processing. In order to generate data, the authors ran synthetic devices that had either an inlier or outlier normal distribution. This is probably an area where their analysis could have improved as many datasets are skewed – not just by outliers. As noise was injected into the data, the authors noted that MacroBase’s Default Pipeline (MDP) decreased in outlier handling accuracy around 25% noise to signal.

The authors showed that MacroBase was able to distinguish abnormally-behaving OLTP MySQL servers running TPC-C and TPC-E workloads. In this experiment, the authors used a cluster of 11 servers in which one experienced degradation. This shows that MacroBase can accurately identify systems in need of recovery in an unsupervised setting.

8. Are there any gaps in the logic/proof?

The authors discuss normalization as a step to condition the data before it is used to train the model or produce classification results. While this is a good first step, many models perform better when the data is normally distributed. There are many ways to fix the skew of a dataset to make it normal (example: log transform) and this paper does not discuss incorporating these techniques.

Along the lines of transformation is labelizing. Labeled data needs to become orthogonal columns (unless a decision tree model is used). As decision trees are not discussed beyond a reference to another paper, the authors should have approached labelizing – or some method of separating out discreet, mutually exclusive, categorical data.

9. Describe at least one possible next step.

While the internet of things or real time application telemetry are both valid use cases for MacroBase, a very interesting problem would be the analysis of the financial markets. This data is very noisy and does not follow a distribution. Also, financial markets are time-based, so forming a model is significantly more challenging.

BibTex Citation

@inproceedings{bailis2017macrobase,
  title={Macrobase: Prioritizing attention in fast data},
  author={Bailis, Peter and Gan, Edward and Madden, Samuel and Narayanan, Deepak and Rong, Kexin and Suri, Sahaana},
  booktitle={Proceedings of the 2017 ACM International Conference on Management of Data},
  pages={541--556},
  year={2017},
  organization={ACM}
}