1 Introduction
LOAD
and STORE
instructions. Loading and writing data elements linearly into and from SIMD registers is usually considered the baseline for memory operators. On the other hand, GATHER/SCATTER
instructions reflect the alternative for non-contiguous memory access, i.e. data elements are distributed over the memory – the common guideline is to avoid GATHER/SCATTER
if possible due to significant performance penalties.GATHER
and SCATTER
is considered very expensive and performance gains are significantly lower than for SIMDified operators with a linear access pattern. However and to the best of our knowledge, there is no clear understanding of the performance behavior of GATHER/SCATTER
under fine-tuned circumstances. With this article1, we aim to enhance this.GATHER
instruction on different Intel CPUs and derive guidelines on using GATHER
efficiently. Within the evaluation, we particularly focus on strided access patterns, where data elements are accessed in an equidistant manner, i.e. populating an SIMD register with data from every \(n\)th element of a data array, with \(n\) as the stride size. Our evaluation results show that the performance of the GATHER
instruction mainly depends on the application of the strided access and on the stride size. As a surprising result, we will show that the GATHER
instruction achieves quite similar performance compared to a LOAD
instruction, if used properly. The relevance of the finding can be seen in encouraging developers to use the GATHER
instruction for fine-grained parallel, partition-based data access implementations. The discussion and presentation of our experimental findings are structured as follows: We start in Sect. 2 by presenting comprehensive results of our systematical evaluation. As we will show, the GATHER
can be quite efficient, but must be used correctly. Based on these results, we propose a novel block-strided access pattern as the foundation for a partition-based SIMD processing concept heavily relying on GATHER
in Sect. 3. To show the applicability and efficiency of our partition-based SIMD processing concept, we compare a simple analytical query template and an integer compression algorithm with their corresponding state-of-the-art SIMD implementation using a linear access pattern in Sect. 4. Finally, we close the paper with related work in Sect. 5 and a summary in Sect. 6.2 Gather Evaluation
LOAD
or a GATHER
instruction. LOAD
is used, whenever a contiguous – also called linear – data access pattern is conducted. A linear access pattern requires that the accessed data elements are organized as a contiguous sequence like in an array, cf. Fig. 1a. GATHER
is applied when a non-contiguous access pattern – data elements from non-consecutive memory locations – is needed. A special case of a non-contiguous data access but with a well-defined and predictable behavior is the strided access pattern realizing an equidistant data access, i.e. there is a constant (but configurable) distance between accessed data elements in a contiguous sequence [21, 36]. The distance is called stride size. Thus, GATHER
generalizes the LOAD
instruction and the question is when and how the GATHER
instruction can be used efficiently. To answer this question, we conducted a systematic evaluation.
2.1 Evaluation Setup
aggregation-sum (AggSum)
operation in our evaluation, because the performance of AggSum
is mainly dominated by the performance of the employed loading instruction. AggSum
iterates over an input array, executes the arithmetic operation (addition) per iteration, and finally writes out the total sum (single value) back into main-memory. Besides the scalar variant, we also implemented two SIMD variants – linear and stride. The linear variant iterates over the data with a linear access pattern and utilizes the LOAD
operation as shown in Fig. 1a, while the stride variant uses a strided access pattern using the GATHER
instruction.GATHER
requires (i) a base address, passed as pointer to the head of the data, and (ii) a stride size. Based on that, the strided access can be performed in two styles as shown in Fig. 1b and c. The traditional style called stride-full is shown in Fig. 1b, proceeding to load data elements according to the stride size until the end of the array is reached. In our example in Fig. 1b, a stride size of two is used and thus, only every second element is conducted during the first run. After this first run, not all data elements of the input array have been accessed. Thus, an additional strided access run starts at the head of the array, whereby the head is now shifted by one position. This is repeated until all data elements have been processed. In our depicted example in Fig. 1b, we require two runs to process all data elements. In general, an important property of this stride-full variant is that the number of runs to process all data elements is equal to the stride size.AggSum
variants and styles – the styles are also called variants in the following – are implemented in C++ for the data types uint64_t
, uint32_t
, double
, and float
. Moreover, the SIMD variants of AggSum
are explicitly SIMDified using (i) AVX2 intrinsics with 256-bit SIMD registers and (ii) AVX512 intrinsics with 512-bit SIMD registers. In this article, we restrict ourselves to the evaluation results for the data types uint64_t
and uint32_t
. The results for double
and float are comparable leading to the same finding. Xeon | Phi 7250 | Gold 6126 | Gold 6240R |
---|---|---|---|
Arch. | MIC – KNL | Skylake | Cascade Lake |
#CPUs | 68 | 12 | 24 |
RAM | 204 GB | 92 GB | 384 GB |
DDR4-2400 | DDR4-2666 | DDR4-2933 | |
SIMD | SSE, AVX(2), AVX512 | ||
L1d | 32 KiB; 8‑way set associative | ||
L2 | 1 MiB; 16-way set associative | ||
L3 | – | 19 MiB | 35.75 MiB |
AggSum
variants are evaluated in single-threaded as well as multi-threaded environment2. In case of multi-threading (only single CPU socket), each thread is pinned to an individual core and processes a disjoint subset of the input array. The final aggregation of the partial sums is then performed in a single-threaded form. All our AggSum
variants were compiled using g++
(version 9.3.0) with the optimization flags -O3 -fno-tree-vectorize
-mavx
-mavx2
-mavx512f
-mavx512cd
. Moreover, we specified the corresponding compiler flag -march
for each CPU. All experiments happened entirely in-memory with an input array of size 4 GiB containing randomly generated values, and were repeated 10 times; we report the averaged result.2.2 Evaluation Results
uint32_t
and uint64_t
are shown in Fig. 3. The diagrams are arranged in tabular form: while each column represents one of our used CPUs, each row represents a combination of (i) the investigated SIMD extensions AVX2/AVX512 and (ii) data types. In each diagram, the stride size in terms of number of data elements (power of 2) is shown on the x‑axis and the throughput in GiB/s on the y‑axis. General outcomes across all experiments are that (i) the linear variants always achieve higher throughputs than the scalar variants and (ii) the throughput difference between scalar and linear varies depending on the CPU, SIMD extension, and data type. Moreover, while there is a throughput difference between uint32_t
and uint64_t
for the scalar variants, this difference is not visible for the linear variants. This clearly highlights the importance of SIMD to achieve the best throughput in all cases. Next, we discuss both stride variants per CPU separately, starting with the CPU whose architecture is the oldest in our evaluation setup.
uint64_t
). On the one hand, 8 64-bit values are loaded per cache line with a size of 512-bit, but if the stride size is larger than \(2^{3}=8\), then 8 cache lines must be loaded per GATHER
instruction to populate the 8 SIMD lanes of an AVX512 SIMD register. Then, for the subsequent GATHER
instruction, 8 new cache lines have to be fetched, but not the adjacent cache lines of the previous one. On the other hand, when the stride size (large power of 2) is equivalent to a page size, successive cache line fetches possibly belong to the same cache set. For uint32_t
, a stride size of \(2^{10}\) corresponds to a page size of 4 KiB, while a stride size of \(2^{9}\) for uint64_t
. This effectively shrinks the size of the L1d cache from 32 KiB to just 8 cache line blocks, or 512 Bytes. Thus, all essential cache lines per GATHER
instruction must always be fetched from slower parts of the memory. This is called cache thrashing and thus, stride-full can not be very efficient.uint64_t
) for large stride sizes. The throughputs for small stride sizes of stride-block are slightly higher compared to stride-full because fetched cache lines are used several times to successively populate multiple SIMD registers. However, looking at (AVX512; uint64_t
), the throughput increases dramatically for a stride size of \(2^{9}\) and the achieved throughput is close to the throughput of the linear variant. The throughput even increases slightly for a stride size of \(2^{10}\) and then it decreases again. As described above, a stride size of \(2^{9}\) for uint64_t
corresponds to the page size and this potentially shrinks the L1d cache to 8 cache line blocks. However, these 8 cache lines blocks are filled with 8 cache lines from 8 different pages, as highlighted in Fig. 1c. Then, stride-block linearly accesses the elements of all cache lines in parallel. That means, if the cache lines are fetched into L1d, the first GATHER
fills an AVX512 SIMD register with all elements of position 0 from the 8 cache lines. Then, the next GATHER
uses all elements at position 1 and so on. When all elements of the 8 cache lines have been processed, the 8 adjacent cache lines within the 8 different pages are fetched and processed. The resulting access pattern corresponds to a linear access, but now, in contrast to the linear variant, 8 pages are accessed in parallel. This effect is diametrical to stride-full and we see this effect in all result diagrams for the Xeon Phi 7250. In all cases, the throughput of stride-block increases when the stride size is equivalent to the page size. However, we only achieve quite similar throughput values as the linear variant for (AVX512; uint64_t
). The difference between the stride sizes \(2^{9}\), \(2^{10}\), and \(2^{11}\) for stride-block on uint64_t
data is that for a stride of \(2^{9}\) elements, 8 consecutive pages are processed, while with \(2^{10}\) and \(2^{11}\) one or two pages are skipped in between. We may conclude that stride-block, when the stride size is equal to the page size or a small multiple of the page size, achieves high throughput values quite similar to linear.uint64_t
), but also for (AVX2; uint32_t
) and (AVX2; uint64_t
). Again, for large stride sizes equivalent to the page size, or a small multiple of it, the above explained effect may kick in. An exception is the combination (AVX512; uint32_t
), where an SIMD register can hold 16 elements and therefore 16 cache lines for 16 pages in the L1d cache would be necessary. Since all our CPUs feature only an 8‑way associative L1d cache, stride-block cannot achieve high throughput values for (AVX512; uint32_t
) as confirmed in our diagrams. Even the newest Intel CPU architecture Golden Cave apparently only has a 12-way associative L1d cache (48 KiB), which would probably not be enough.uint32_t
or \(2^{9}\) for uint64_t
) is equal to the page size, we achieve similar throughput values as for the linear variant. For the Xeon Gold 6240R and the combination (AVX2; uint32_t
), the throughput of the stride-block is even slightly higher than for the linear variant. Thus, our multi-threaded evaluation shows that stride-block scales comparably when considering the linear variant.
3 Partition-based SIMD
GATHER
operation with (almost) the same performance as with data elements from consecutive memory locations using the LOAD
operation in single-threaded as well as multi-threaded environments. However, the GATHER
requires a proper access pattern to achieve its peak performance. According to the evaluation, our strided access pattern variant stride-block with the following properties \(P_{1}\) and \(P_{2}\) perfectly fits:GATHER
. To efficiently utilize the increasing number of cores in scale-up hardware systems for data processing, the data-oriented architecture was proposed that turned out to show a superior scalability [17, 18, 22, 31]. The core concept of this data-oriented architecture is that all data objects are implicitly partitioned and disjunct partitions are exclusively accessed by the assigned worker thread that is pinned to a specific core (hardware thread). In line with this architecture, our partition-based SIMD processing concept implicitly partitions data by our access pattern and partitions – in this case pages – are assigned to SIMD lanes. Then, the pages are linearly processed in a distributed fashion by the SIMD lanes operating on their local pages. Thus, our partition-based SIMD processing concept is the logical continuation of the data-oriented architecture, which has only considered for CPU cores so far.4 Application Use Cases
4.1 Vectorized Query Processing
filter
and an aggregation sum (AggSum)
– using a linear access pattern as done in state-of-the-art and our block-strided access pattern. Both operators consume a vector of data in columnar format. The filter
operator initially broadcasts the predicate-value into an SIMD register. Then, the filter transfers data from column \(B\) into an SIMD register in every iteration and compares it to the predicate vector register. The transfer is done either using a load
operation for the linear access pattern or a gather
for the block-strided access pattern. AVX2 and AVX512 directly support comparing vector registers with the cmp
operation. The result is transformed into a bitmask to reduce materialization costs. If the \(n\)th bit of the resulting bitmask is set to one, the \(n\)th element from the data SIMD register is valid for the specific filter. Since we are not storing any additional positional information alongside the bitmask, an operator that consumes this bitmask must decode the specific information implicitly. While this transformation is done on the fly when using AVX512, the efficient AVX2 implementation is more challenging as highlighted in Fig. 6. For example, the _mm256_cmpeq_epi32
operation for AVX2 – to compare two SIMD registers containing 32-bit integers regarding equality – produces an SIMD register of the same size that contains either values where all bits are set to one, if the corresponding values of the two input registers were equal or to zero otherwise. To convert this output into a bitmask, the resulting SIMD register has to be cast into a double-precision SIMD register using the _mm256_castsi256_ps
operation, which is done only for compilation purposes and does not incur any actual computations. Next, the most significant bit of every 64-bit element is wrapped up in a single word using the _mm256_movemask_pd
operation. While this procedure seems expensive, our experiments showed that the resulting memory reduction of up to 32x over-compensates the additional computational costs for memory-bound algorithms.
AggSum
operator is called for the first time, a result SIMD register is zero-initialized. The operator returns this SIMD register after a complete vector is processed. The same register is then used as an input for every subsequent call and is changed within every processed vector. All elements from the resulting SIMD register are summed up and returned when all data is processed. For every vector, the AggSum
operator transfers the relevant data from column \(A\) into an SIMD register and loads the positionally equivalent bitmask from the previous operator.uint32_t
) with a vector size of 1024 values and (b) (AVX512; uint64_t
) with a vector size of 2048 values are shown in Fig. 7. Since our implementation of AggSum
branches, the overall query throughput depends on the selectivity of the filter. The branching checks the resulting bitmask of the filter
, because the functionality of the AggSum
operator is only executed when the bitmask contains at least one bit indicating a valid filter result. If the bitmask only contains zeros, the aggregation is skipped. As shown in Fig. 7, our partition-based SIMD processing implementation does not amortize on the Xeon Phi 7250. However, it is on par with the linear access on newer CPUs and even slightly better for selectivities of at least 0.2 on Skylake and 0.05 on Cascade Lake. Interestingly, while all investigated gap factors perform similarly on the Xeon Phi, \(g=1\) is optimal on our Skylake CPU and \(g=4\) is optimal on our Cascade Lake CPU. Thus, we conclude that our partition-based SIMD processing concept can be efficiently applied for the vectorized processing model.
4.2 Integer Compression
STORE
instruction to write out the compressed values, our block-strided variant requires a selective SCATTER
instruction.uint64_t
). Here, we used different synthetic data sets with randomly generated 64-bit unsigned integer values. Each data set contains 100 million values of a specific bit width and the bit width is depicted at the x‑axis. As we can see in Fig. 8a, the SIMD variant using our block-strided access pattern (\(g=1\)) closely matches the speed of the linear variant. Both SIMD variants compress blocks of 512 consecutive values producing the same compression output. However, the SCATTER
instruction introduces a slight overhead for our block-strided variant.
5 Related Work
GATHER
instruction provides only a marginal benefit for complex operators such as hash joins [4‐6, 23, 28]. Our evaluation has shown the opposite and that the GATHER
instruction can be very efficient with a fine-tuned access pattern. Now it would be interesting to see how our proposed block-strided access pattern could be used for hash joins, for example. Another promising use case would be Bloom Filter as shown in [27], where the authors gather parts of a Bloom Filter. Moreover, a recent work has introduced the idea of sharing vector registers for concurrently running queries [24, 25]. Here, the authors used the GATHER
instruction as one alternative, which was not the best, but still gave reasonable results. Thus, we conclude that the GATHER
instruction offers a very flexible way to populate SIMD registers with high performance, if applied properly. The same applies to the SCATTER
instruction, which should be examined more closely in future work.6 Conclusion
GATHER
instruction for strided access patterns on different Intel processors. As we have shown experimentally, GATHER
can achieve the same performance as the LOAD instruction, if our proposed block-strided access pattern is applied. Furthermore, we have illustrated that our new access pattern can be used for a partition-based SIMD processing concept and this new concept can be applied very well to more complex use cases such as (simple) analytical queries or integer compression algorithms. In both cases, we achieved slightly better performances compared to the state-of-the-art SIMD-based implementation using a linear access. Overall, our new access pattern opens up a new dimension for efficient fine-grained, partition-based SIMD implementations.