Skip to main content
Erschienen in: Datenbank-Spektrum 1/2023

Open Access 07.12.2022 | Fachbeitrag

Partition-based SIMD Processing and its Application to Columnar Database Systems

verfasst von: Juliana Hildebrandt, Johannes Pietrzyk, Alexander Krause, Dirk Habich, Wolfgang Lehner

Erschienen in: Datenbank-Spektrum | Ausgabe 1/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The Single Instruction Multiple Data (SIMD) paradigm became a core principle for optimizing query processing in columnar database systems. Until now, only the LOAD/STORE instructions are considered to be efficient enough to achieve the expected speedups, while avoiding GATHER/SCATTER is considered almost imperative. However, the GATHER instruction offers a very flexible way to populate SIMD registers with data elements coming from non-consecutive memory locations. As we will discuss within this article, the GATHER instruction can achieve the same performance as the LOAD instruction, if applied properly. To enable the proper usage, we outline a novel access pattern allowing fine-grained, partition-based SIMD implementations. Then, we apply this partition-based SIMD processing to two representative examples from columnar database systems to experimentally demonstrate the applicability and efficiency of our new access pattern.

1 Introduction

Single Instruction Multiple Data (SIMD) is a well-known parallelism concept that is characterized by the fact that the same operation is simultaneously applied on multiple data elements within a single instruction [15]. Modern CPUs provide direct support of such SIMD capabilities using specific SIMD instruction set extensions, with different versions of the Advanced Vector Extensions (AVX) being prominent representatives on Intel CPUs. SIMD extensions consist of two main building blocks: (i) SIMD registers, which are larger than traditional scalar registers, and (ii) SIMD instructions working on those SIMD registers. The set of SIMD instructions includes arithmetic operations, Boolean operators, logical and arithmetic shifts, and data type conversions. In addition, there are specific SIMD instructions to load data from main-memory into SIMD registers and write it back. On the one hand, data elements placed contiguously in memory are accessed by LOAD and STORE instructions. Loading and writing data elements linearly into and from SIMD registers is usually considered the baseline for memory operators. On the other hand, GATHER/SCATTER instructions reflect the alternative for non-contiguous memory access, i.e. data elements are distributed over the memory – the common guideline is to avoid GATHER/SCATTER if possible due to significant performance penalties.
In the context of efficient query execution, SIMD has emerged as a core technique to improve the single-thread query performance especially in state-of-the-art in-memory column-stores [1, 3, 10, 29]. Here, SIMD is typically applied to isolated query operators [30, 38]. In particular, query operators requiring a contiguous memory access (linear access) such as scans [12, 35] or integer (de)compression techniques [2, 10, 19] are well-investigated. These query operators usually achieve the desired SIMD performance gain. Moreover, query operators with a non-contiguous memory access, such as hash-joins [46, 23] or sorting [8, 26, 32], have been also explored. In this case, a slight performance gain is achieved, but the usage of GATHER and SCATTER is considered very expensive and performance gains are significantly lower than for SIMDified operators with a linear access pattern. However and to the best of our knowledge, there is no clear understanding of the performance behavior of GATHER/SCATTER under fine-tuned circumstances. With this article1, we aim to enhance this.
Contribution and Outline: As the core contribution, we present results of a systematic evaluation of the GATHER instruction on different Intel CPUs and derive guidelines on using GATHER efficiently. Within the evaluation, we particularly focus on strided access patterns, where data elements are accessed in an equidistant manner, i.e. populating an SIMD register with data from every \(n\)th element of a data array, with \(n\) as the stride size. Our evaluation results show that the performance of the GATHER instruction mainly depends on the application of the strided access and on the stride size. As a surprising result, we will show that the GATHER instruction achieves quite similar performance compared to a LOAD instruction, if used properly. The relevance of the finding can be seen in encouraging developers to use the GATHER instruction for fine-grained parallel, partition-based data access implementations. The discussion and presentation of our experimental findings are structured as follows: We start in Sect. 2 by presenting comprehensive results of our systematical evaluation. As we will show, the GATHER can be quite efficient, but must be used correctly. Based on these results, we propose a novel block-strided access pattern as the foundation for a partition-based SIMD processing concept heavily relying on GATHER in Sect. 3. To show the applicability and efficiency of our partition-based SIMD processing concept, we compare a simple analytical query template and an integer compression algorithm with their corresponding state-of-the-art SIMD implementation using a linear access pattern in Sect. 4. Finally, we close the paper with related work in Sect. 5 and a summary in Sect. 6.

2 Gather Evaluation

Contrary to scalar processing, SIMD registers must be explicitly populated using a LOAD or a GATHER instruction. LOAD is used, whenever a contiguous – also called linear – data access pattern is conducted. A linear access pattern requires that the accessed data elements are organized as a contiguous sequence like in an array, cf. Fig. 1a. GATHER is applied when a non-contiguous access pattern – data elements from non-consecutive memory locations – is needed. A special case of a non-contiguous data access but with a well-defined and predictable behavior is the strided access pattern realizing an equidistant data access, i.e. there is a constant (but configurable) distance between accessed data elements in a contiguous sequence [21, 36]. The distance is called stride size. Thus, GATHER generalizes the LOAD instruction and the question is when and how the GATHER instruction can be used efficiently. To answer this question, we conducted a systematic evaluation.

2.1 Evaluation Setup

We use different data access patterns for the aggregation-sum (AggSum) operation in our evaluation, because the performance of AggSum is mainly dominated by the performance of the employed loading instruction. AggSum iterates over an input array, executes the arithmetic operation (addition) per iteration, and finally writes out the total sum (single value) back into main-memory. Besides the scalar variant, we also implemented two SIMD variants – linear and stride. The linear variant iterates over the data with a linear access pattern and utilizes the LOAD operation as shown in Fig. 1a, while the stride variant uses a strided access pattern using the GATHER instruction.
When performing a strided access, the GATHER requires (i) a base address, passed as pointer to the head of the data, and (ii) a stride size. Based on that, the strided access can be performed in two styles as shown in Fig. 1b and c. The traditional style called stride-full is shown in Fig. 1b, proceeding to load data elements according to the stride size until the end of the array is reached. In our example in Fig. 1b, a stride size of two is used and thus, only every second element is conducted during the first run. After this first run, not all data elements of the input array have been accessed. Thus, an additional strided access run starts at the head of the array, whereby the head is now shifted by one position. This is repeated until all data elements have been processed. In our depicted example in Fig. 1b, we require two runs to process all data elements. In general, an important property of this stride-full variant is that the number of runs to process all data elements is equal to the stride size.
Besides this traditional style, the strided access can be slightly modified, as shown in Fig. 1c. Here, we propose to logically partition the input array into blocks, hence the name stride-block. In our example in Fig. 1c, we defined a block size of 8 with a stride size of two. That means, we need two runs with an SIMD register size of four to process each block with a size of 8. Thus, the strided access is not immediately performed as a whole like in the stride-full variant, but over all data elements within the block that have already been skipped with the previous runs. Once a block is completely processed, the next block is considered. This new, adopted style should be more cache-friendly than the stride-full, since a cache line gets fully processed after being fetched. An important property of this stride-block variant is that the SIMD register size in terms of number of elements and the chosen stride size defines the block size as shown in our example in Fig. 1c.
All AggSum variants and styles – the styles are also called variants in the following – are implemented in C++ for the data types uint64_t, uint32_t, double, and float. Moreover, the SIMD variants of AggSum are explicitly SIMDified using (i) AVX2 intrinsics with 256-bit SIMD registers and (ii) AVX512 intrinsics with 512-bit SIMD registers. In this article, we restrict ourselves to the evaluation results for the data types uint64_t and uint32_t. The results for double and float are comparable leading to the same finding.
Table 1
CPU specifications
Xeon
Phi 7250
Gold 6126
Gold 6240R
Arch.
MIC – KNL
Skylake
Cascade Lake
#CPUs
68
12
24
RAM
204 GB
92 GB
384 GB
 
DDR4-2400
DDR4-2666
DDR4-2933
SIMD
SSE, AVX(2), AVX512
L1d
32 KiB; 8‑way set associative
L2
1 MiB; 16-way set associative
L3
19 MiB
35.75 MiB
All AggSum variants are evaluated in single-threaded as well as multi-threaded environment2. In case of multi-threading (only single CPU socket), each thread is pinned to an individual core and processes a disjoint subset of the input array. The final aggregation of the partial sums is then performed in a single-threaded form. All our AggSum variants were compiled using g++ (version 9.3.0) with the optimization flags -O3 -fno-tree-vectorize -mavx -mavx2 -mavx512f -mavx512cd. Moreover, we specified the corresponding compiler flag -march for each CPU. All experiments happened entirely in-memory with an input array of size 4 GiB containing randomly generated values, and were repeated 10 times; we report the averaged result.
We conducted our evaluation on three different Intel CPUs as shown in Table 1. As highlighted in Table 1, all our CPUs feature k‑way associative caches as modern CPUs generally do. In a k-way set associative cache, the cache is divided into \(v\) sets and each set consists of \(k\) cache line blocks [34]. These cache line blocks of a set are placed in sequence one after another as illustrated in Fig. 2. Moreover, each memory address maps to a specific set in a k-way set associative cache, but it can map to any of the \(k\) cache line blocks in the set [34]. The size of a cache line is typically 64 Byte. With a 32KiB, 8‑way associative L1d cache, the L1d of all our CPUs contains 64 sets with 8 cache line blocks as shown in Fig. 2. Moreover, the interplay between L1d, L2, and L3 is hardware-optimized on CPUs. For example, the L1d cache on the Xeon Phi 7250 supports two simultaneous 512-bit reads and the L1 hardware prefetcher monitors memory patterns to generate data prefetch requests to the L2 cache to bring cache lines in advance [34]. Since mordern CPUs feature such sophisticated hardware data prefetching mechanisms to reduce memory access times targeting simple and regular memory access patterns [11], we do not use software-based prefetching. In all experiments, we used throughput in GiB/s as evaluation metric, as we assume higher values are always better.

2.2 Evaluation Results

The single-threaded evaluation results for the data types uint32_t and uint64_t are shown in Fig. 3. The diagrams are arranged in tabular form: while each column represents one of our used CPUs, each row represents a combination of (i) the investigated SIMD extensions AVX2/AVX512 and (ii) data types. In each diagram, the stride size in terms of number of data elements (power of 2) is shown on the x‑axis and the throughput in GiB/s on the y‑axis. General outcomes across all experiments are that (i) the linear variants always achieve higher throughputs than the scalar variants and (ii) the throughput difference between scalar and linear varies depending on the CPU, SIMD extension, and data type. Moreover, while there is a throughput difference between uint32_t and uint64_t for the scalar variants, this difference is not visible for the linear variants. This clearly highlights the importance of SIMD to achieve the best throughput in all cases. Next, we discuss both stride variants per CPU separately, starting with the CPU whose architecture is the oldest in our evaluation setup.
The Xeon Phi 7250 CPU has the second-generation MIC architecture from Intel called Knights Landing (KNL) [34]. As we can see in Fig. 3, the linear variant works quite good in all cases. However, looking at the results of both stride variants, we may conclude that these variants are negligible at a first glance. In many cases, the achieved throughput is much lower than for scalar, with stride-block being better than stride-full. In particular, with increasing stride sizes, the throughput of stride-full always decreases. Two effects are possibly responsible for this and they are explained for (AVX512; uint64_t). On the one hand, 8 64-bit values are loaded per cache line with a size of 512-bit, but if the stride size is larger than \(2^{3}=8\), then 8 cache lines must be loaded per GATHER instruction to populate the 8 SIMD lanes of an AVX512 SIMD register. Then, for the subsequent GATHER instruction, 8 new cache lines have to be fetched, but not the adjacent cache lines of the previous one. On the other hand, when the stride size (large power of 2) is equivalent to a page size, successive cache line fetches possibly belong to the same cache set. For uint32_t, a stride size of \(2^{10}\) corresponds to a page size of 4 KiB, while a stride size of \(2^{9}\) for uint64_t. This effectively shrinks the size of the L1d cache from 32 KiB to just 8 cache line blocks, or 512 Bytes. Thus, all essential cache lines per GATHER instruction must always be fetched from slower parts of the memory. This is called cache thrashing and thus, stride-full can not be very efficient.
The stride-block variant behaves similarly to stride-full for small stride sizes, but has a striking and interesting exception for (AVX512; uint64_t) for large stride sizes. The throughputs for small stride sizes of stride-block are slightly higher compared to stride-full because fetched cache lines are used several times to successively populate multiple SIMD registers. However, looking at (AVX512; uint64_t), the throughput increases dramatically for a stride size of \(2^{9}\) and the achieved throughput is close to the throughput of the linear variant. The throughput even increases slightly for a stride size of \(2^{10}\) and then it decreases again. As described above, a stride size of \(2^{9}\) for uint64_t corresponds to the page size and this potentially shrinks the L1d cache to 8 cache line blocks. However, these 8 cache lines blocks are filled with 8 cache lines from 8 different pages, as highlighted in Fig. 1c. Then, stride-block linearly accesses the elements of all cache lines in parallel. That means, if the cache lines are fetched into L1d, the first GATHER fills an AVX512 SIMD register with all elements of position 0 from the 8 cache lines. Then, the next GATHER uses all elements at position 1 and so on. When all elements of the 8 cache lines have been processed, the 8 adjacent cache lines within the 8 different pages are fetched and processed. The resulting access pattern corresponds to a linear access, but now, in contrast to the linear variant, 8 pages are accessed in parallel. This effect is diametrical to stride-full and we see this effect in all result diagrams for the Xeon Phi 7250. In all cases, the throughput of stride-block increases when the stride size is equivalent to the page size. However, we only achieve quite similar throughput values as the linear variant for (AVX512; uint64_t). The difference between the stride sizes \(2^{9}\), \(2^{10}\), and \(2^{11}\) for stride-block on uint64_t data is that for a stride of \(2^{9}\) elements, 8 consecutive pages are processed, while with \(2^{10}\) and \(2^{11}\) one or two pages are skipped in between. We may conclude that stride-block, when the stride size is equal to the page size or a small multiple of the page size, achieves high throughput values quite similar to linear.
The Xeon Gold 6126 CPU has an Intel Skylake architecture, which is newer than the KNL, but with the same characteristics of the L1d and the L2 cache. In addition, this processor has an L3 cache of size 19 MiB. Here, the general outcome is that the positive impression of the stride-block variant is reinforced. Now, we achieve high throughput values for stride-block not only for (AVX512; uint64_t), but also for (AVX2; uint32_t) and (AVX2; uint64_t). Again, for large stride sizes equivalent to the page size, or a small multiple of it, the above explained effect may kick in. An exception is the combination (AVX512; uint32_t), where an SIMD register can hold 16 elements and therefore 16 cache lines for 16 pages in the L1d cache would be necessary. Since all our CPUs feature only an 8‑way associative L1d cache, stride-block cannot achieve high throughput values for (AVX512; uint32_t) as confirmed in our diagrams. Even the newest Intel CPU architecture Golden Cave apparently only has a 12-way associative L1d cache (48 KiB), which would probably not be enough.
Our used Xeon Gold 6240R features an Intel Cascade Lake architecture, which is an optimization of the Skylake and thus the newest platform among the examined processors. The results on this processor once again confirm our previous findings regarding stride-full and stride-block. In contrast to the previous CPUs, larger stride sizes are now also feasible for stride-block probably due to the larger L3 cache.
We also performed all experiments in the multi-threaded environment and present representative results in Fig. 4. The diagrams are organized again as described for Fig. 3 and we report the achieved throughputs for four cores. For other numbers of cores we got similar results. We do not include stride-full in these diagrams due to the reasons described above. For stride-block, we see the same curve shapes on all CPUs as in the single-threaded environment leading to the same outcome. As soon as the stride size (\(2^{10}\) for uint32_t or \(2^{9}\) for uint64_t) is equal to the page size, we achieve similar throughput values as for the linear variant. For the Xeon Gold 6240R and the combination (AVX2; uint32_t), the throughput of the stride-block is even slightly higher than for the linear variant. Thus, our multi-threaded evaluation shows that stride-block scales comparably when considering the linear variant.

3 Partition-based SIMD

Our experimental evaluation has shown that SIMD registers can be populated with data elements from non-consecutive memory locations using the GATHER operation with (almost) the same performance as with data elements from consecutive memory locations using the LOAD operation in single-threaded as well as multi-threaded environments. However, the GATHER requires a proper access pattern to achieve its peak performance. According to the evaluation, our strided access pattern variant stride-block with the following properties \(P_{1}\) and \(P_{2}\) perfectly fits:
(\(P_{1}\))
The input data is logically partitioned into blocks, with their size in Bytes being calculated by
$$\text{block}_{\text{size}}=\text{page}_{\text{size}}\cdot k\cdot g$$
Here, \(k\) is the number of SIMD lanes of the underlying SIMD register according to the used data type, \(g\) is a page gap factor with \(g\geq 1\), and \(\text{page}_{\text{size}}\) is determined by the operating system (usually 4KiB).
(\(P_{2}\))
The logical blocks are successively processed and the employed access pattern within each block is a strided access pattern with a stride size in Bytes of:
$$\text{stride}_{\text{size}}=\text{page}_{\text{size}}\cdot g;$$
With these properties, we denote the resulting access pattern as block-strided access pattern with an example shown in Fig. 5. In the illustrated example, the number of SIMD lanes \(k\) is four, so that the input data is logically partitioned into blocks consisting of four pages with a gap factor \(g=1\). The gap factor \(g\) is intended for CPU-conscious fine-tuning of this access pattern, according to our evaluation. If \(g=1\) as illustrated in Fig. 5, then a block consists of \(k\) consecutive pages. Then, each SIMD lane is assigned one page from the whole block and each lane is further responsible for the processing of the assigned page. If \(g> 1\), then a block consists of \(g\cdot k\) consecutive pages and an SIMD lane is responsible for \(g\) pages, whereas the pages are processed one after the other. In other words, \(g\) defines the gap between two accessed pages and the pages within each block are processed using a strided access with a stride size of \(g\).
This access pattern enables a fine-grained, page-partitioned SIMD processing concept as illustrated in Fig. 5 heavily relying on GATHER. To efficiently utilize the increasing number of cores in scale-up hardware systems for data processing, the data-oriented architecture was proposed that turned out to show a superior scalability [17, 18, 22, 31]. The core concept of this data-oriented architecture is that all data objects are implicitly partitioned and disjunct partitions are exclusively accessed by the assigned worker thread that is pinned to a specific core (hardware thread). In line with this architecture, our partition-based SIMD processing concept implicitly partitions data by our access pattern and partitions – in this case pages – are assigned to SIMD lanes. Then, the pages are linearly processed in a distributed fashion by the SIMD lanes operating on their local pages. Thus, our partition-based SIMD processing concept is the logical continuation of the data-oriented architecture, which has only considered for CPU cores so far.

4 Application Use Cases

To evaluate when and how our partition-based SIMD processing concept can be used in more complex scenarios, we applied it to two different use cases from the domain of columnar database systems.

4.1 Vectorized Query Processing

A very interesting use case for our partition-based SIMD approach is the vectorized query processing model of columnar database systems [7]. In this model, each query operator iteratively processes a vector of values [7]. The advantages are good instruction caching and CPU utilization while keeping the materialization costs low [7, 39]. However, choosing an appropriate vector size is not trivial [39]. Small vector sizes may lead to improved data cache utilization while increasing the number of instruction cache misses. Large vector sizes may substantially increase the materialization costs and hurt the data cache utilization. Since our partition-based SIMD processing concept aims at explicitly caching of data needed for current and future processing of multiple pages, our concept may improve the performance of this processing model compared to a linear access.
To investigate that, we use a common analytical query calculating an aggregated sum of a column \(A\) for all tuples that satisfy a given predicate over a column \(B\). We implemented the necessary operators – namely a filter and an aggregation sum (AggSum) – using a linear access pattern as done in state-of-the-art and our block-strided access pattern. Both operators consume a vector of data in columnar format. The filter operator initially broadcasts the predicate-value into an SIMD register. Then, the filter transfers data from column \(B\) into an SIMD register in every iteration and compares it to the predicate vector register. The transfer is done either using a load operation for the linear access pattern or a gather for the block-strided access pattern. AVX2 and AVX512 directly support comparing vector registers with the cmp operation. The result is transformed into a bitmask to reduce materialization costs. If the \(n\)th bit of the resulting bitmask is set to one, the \(n\)th element from the data SIMD register is valid for the specific filter. Since we are not storing any additional positional information alongside the bitmask, an operator that consumes this bitmask must decode the specific information implicitly. While this transformation is done on the fly when using AVX512, the efficient AVX2 implementation is more challenging as highlighted in Fig. 6. For example, the _mm256_cmpeq_epi32 operation for AVX2 – to compare two SIMD registers containing 32-bit integers regarding equality – produces an SIMD register of the same size that contains either values where all bits are set to one, if the corresponding values of the two input registers were equal or to zero otherwise. To convert this output into a bitmask, the resulting SIMD register has to be cast into a double-precision SIMD register using the _mm256_castsi256_ps operation, which is done only for compilation purposes and does not incur any actual computations. Next, the most significant bit of every 64-bit element is wrapped up in a single word using the _mm256_movemask_pd operation. While this procedure seems expensive, our experiments showed that the resulting memory reduction of up to 32x over-compensates the additional computational costs for memory-bound algorithms.
When the AggSum operator is called for the first time, a result SIMD register is zero-initialized. The operator returns this SIMD register after a complete vector is processed. The same register is then used as an input for every subsequent call and is changed within every processed vector. All elements from the resulting SIMD register are summed up and returned when all data is processed. For every vector, the AggSum operator transfers the relevant data from column \(A\) into an SIMD register and loads the positionally equivalent bitmask from the previous operator.
Note that the data-transferring method must be the same as the previous operator. If the bitmask contains at least one bit, it is used to perform a masked SIMD-add of the data SIMD register and the result SIMD register. While AVX512 supports masked addition natively, we need to create a special mask SIMD register on our own for AVX2 and use it to zero-out the invalid data elements. This SIMD register contains elements with either all bits set to 1 or 0. Thus, the mask is first broadcasted into an SIMD register, shifted by the lane-id and binary ANDed with 1. That way, every lane in the SIMD register contains the value one if the corresponding bit was 1, zero otherwise. All elements in the SIMD register are decreased by 1 utilizing the two’s-complement where \(-1\) equals all bits set to 1. Lastly, the resulting SIMD register is negated and binary ANDed with the data SIMD register.
In our evaluation, we varied (i) the selectivity of the filter, (ii) the vector size, and (iii) the page gap factor \(g\). Representative single-threaded evaluation results for (a) (AVX2; uint32_t) with a vector size of 1024 values and (b) (AVX512; uint64_t) with a vector size of 2048 values are shown in Fig. 7. Since our implementation of AggSum branches, the overall query throughput depends on the selectivity of the filter. The branching checks the resulting bitmask of the filter, because the functionality of the AggSum operator is only executed when the bitmask contains at least one bit indicating a valid filter result. If the bitmask only contains zeros, the aggregation is skipped. As shown in Fig. 7, our partition-based SIMD processing implementation does not amortize on the Xeon Phi 7250. However, it is on par with the linear access on newer CPUs and even slightly better for selectivities of at least 0.2 on Skylake and 0.05 on Cascade Lake. Interestingly, while all investigated gap factors perform similarly on the Xeon Phi, \(g=1\) is optimal on our Skylake CPU and \(g=4\) is optimal on our Cascade Lake CPU. Thus, we conclude that our partition-based SIMD processing concept can be efficiently applied for the vectorized processing model.

4.2 Integer Compression

A second obvious use case in the context of column-stores is integer compression [1, 2, 10]. To keep the additional computational effort of (de)compression as low as possible, most of the integer (de)compression algorithms are explicitly SIMDified with a linear access pattern [9, 19, 33, 37]. The ensure a linear access pattern, the state-of-the-art SIMD approach can be characterized by [37]: While a scalar compression algorithm would compress a block of \(N\) consecutive integers, the state-of-the-art SIMD approach scales this block to \(k\cdot N\) consecutive integers with \(k\) as the number of integers that can be simultaneously processed with an SIMD register.
To evaluate our partition-based SIMD processing concept in this use case, we decided to investigate BitPacking (BP), which is a heavily used and well performing algorithm [19]. BP belongs to the class of null suppression by omitting leading zeros. The scalar version for 64-bit integer values successively compresses blocks of 64 consecutive values with the minimal number of bits required for the largest element within each block. For 512-bit SIMD registers, the blocks are scaled to 512 consecutive values for the state-of-the-art SIMD implementation. We implemented BP3 in its scalar variant as well as in two SIMD variants using a linear and a block-strided access pattern for 32-bit and 64-bit data. While the linear variant uses the STORE instruction to write out the compressed values, our block-strided variant requires a selective SCATTER instruction.
Fig. 8a shows a representative single-threaded evaluation result in terms of million integers per second (mis) for all investigated BP variants on the Xeon Gold 6240R for (AVX512; uint64_t). Here, we used different synthetic data sets with randomly generated 64-bit unsigned integer values. Each data set contains 100 million values of a specific bit width and the bit width is depicted at the x‑axis. As we can see in Fig. 8a, the SIMD variant using our block-strided access pattern (\(g=1\)) closely matches the speed of the linear variant. Both SIMD variants compress blocks of 512 consecutive values producing the same compression output. However, the SCATTER instruction introduces a slight overhead for our block-strided variant.
In contrast to the state-of-the-art SIMD approach, the advantage of our partition-based SIMD implementation is that we do not necessarily have to scale the blocks by the factor \(k\). As shown in [13], this state-of-the-art scaling works well from a performance perspective but can lead to a degradation of the compression ratio compared to the scalar variant. This non-scaling property becomes possible by our new access pattern because we are able to consider each SIMD lane separately for concurrent processing. In terms of integer compression, this means, our new access pattern allows to concurrently compress for example different disjoint blocks of 64 consecutive values for 64-bit integer values per SIMD lane like the scalar variant. We also implemented these specific variants and conducted a comprehensive evaluation as well. For this evaluation, we created different synthetic data sets where the 64-bit integer values are mainly characterized by a bit width of 2, but we also have a probability, e.g., of \(p(\text{bw})=0.001\) for integer values with a larger bit width bw. We varied this larger bit width bw from 3 to 64 as shown on the x‑axis in Fig. 8b. For these data sets, the compressed output for BP using blocks of 64 consecutive values is much smaller than for any \(k\)-scaled block. Since we have to write out less, our partition-based SIMD implementation for blocks of 64 consecutive value clearly outperforms the state-of-the-art linear SIMD variant as shown in Fig. 8b.
The SIMD parallel paradigm has received a lot of attention to increase query performance, in particular in the domain of in-memory column-stores [1, 3, 10, 29]. Without claim of completeness, it can nonetheless be stated that current approaches mainly focus on SIMDifying isolated operators [12, 30, 35, 38], compression [2, 10, 19] and sorting algorithms [8, 26, 32], or on completely new query processing models [10, 16, 20, 28]. However, until now, the main assumption has been that the GATHER instruction provides only a marginal benefit for complex operators such as hash joins [46, 23, 28]. Our evaluation has shown the opposite and that the GATHER instruction can be very efficient with a fine-tuned access pattern. Now it would be interesting to see how our proposed block-strided access pattern could be used for hash joins, for example. Another promising use case would be Bloom Filter as shown in [27], where the authors gather parts of a Bloom Filter. Moreover, a recent work has introduced the idea of sharing vector registers for concurrently running queries [24, 25]. Here, the authors used the GATHER instruction as one alternative, which was not the best, but still gave reasonable results. Thus, we conclude that the GATHER instruction offers a very flexible way to populate SIMD registers with high performance, if applied properly. The same applies to the SCATTER instruction, which should be examined more closely in future work.

6 Conclusion

In this article, we have presented a systematic evaluation of the GATHER instruction for strided access patterns on different Intel processors. As we have shown experimentally, GATHER can achieve the same performance as the LOAD instruction, if our proposed block-strided access pattern is applied. Furthermore, we have illustrated that our new access pattern can be used for a partition-based SIMD processing concept and this new concept can be applied very well to more complex use cases such as (simple) analytical queries or integer compression algorithms. In both cases, we achieved slightly better performances compared to the state-of-the-art SIMD-based implementation using a linear access. Overall, our new access pattern opens up a new dimension for efficient fine-grained, partition-based SIMD implementations.

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Unsere Produktempfehlungen

Datenbank-Spektrum

Datenbank-Spektrum ist das offizielle Organ der Fachgruppe Datenbanken und Information Retrieval der Gesellschaft für Informatik (GI) e.V. Die Zeitschrift widmet sich den Themen Datenbanken, Datenbankanwendungen und Information Retrieval.

Fußnoten
1
This article is an extended version of [14]. In particular, this article includes an extensive GATHER evaluation and an additional representative example from columnar database systems compared to [14].
 
2
The evaluation code is available open-source on GitHub: https://​github.​com/​db-tu-dresden/​gather-scatter-eval.
 
Literatur
Metadaten
Titel
Partition-based SIMD Processing and its Application to Columnar Database Systems
verfasst von
Juliana Hildebrandt
Johannes Pietrzyk
Alexander Krause
Dirk Habich
Wolfgang Lehner
Publikationsdatum
07.12.2022
Verlag
Springer Berlin Heidelberg
Erschienen in
Datenbank-Spektrum / Ausgabe 1/2023
Print ISSN: 1618-2162
Elektronische ISSN: 1610-1995
DOI
https://doi.org/10.1007/s13222-022-00431-0

Weitere Artikel der Ausgabe 1/2023

Datenbank-Spektrum 1/2023 Zur Ausgabe

Dissertationen

Dissertationen

News

News

Editorial

Editorial

Premium Partner