My Project
|
The development of STXXL has been started with sorting, because it is the
fundamental tool for I/O-efficient processing of large data sets. Therefore, an efficient implementation of sorting largely defines the performance of an external memory software library as a whole. To achieve the best performance our implementation [DemSan03] uses parallel disks, has an optimal I/O volume
No previous implementation has all these properties, which are needed for a good practical sorting. LEDA-SM [CraMeh99] and TPIE [APV02] concentrate on single disk implementations. For the overlapping of I/O and computation they rely on prefetching and caching provided by the operating system, which is suboptimal since the system knows little about the application's access pattern.
Barve and Vitter implemented a parallel disk algorithm [BarGroVit97] that can be viewed as the immediate ancestor of our algorithm. Innovations with respect to our sorting are: A different allocation strategy that enables better theoretical I/O bounds [HutSanVit01b] [KalVar01]; a prefetching algorithm that optimizes the number of I/O steps and never evicts data previously fetched; overlapping of I/O and computation; a completely asynchronous implementation that reacts flexibly to fluctuations in disk speeds; and an implementation that sorts many GBytes and does not have to limit internal memory size artificially to obtain a nontrivial number of runs. Additionally, our implementation is not a prototype, it has a generic interface and is a part of the software library STXXL.
Algorithms in [Raj98] [ChaCor02] [ChaCorWis01] have the theoretical advantage of being deterministic. However, they need three passes over data even for not too large inputs.
Prefetch buffers for disk load balancing and overlapping of I/O and computation have been intensively studied for external memory merge sort ([PaiVar92] [CaoFelKarLi96] [AlbGarLeo98] [HutSanVit01b] [KalVar01] [KimKar00]). But we have not seen results that guarantee overlapping of I/O and computation during the parallel disk merging of arbitrary runs.
There are many good practical implementations of sorting (e.g. [NBCGL94] [Aga96] [NKG00] [Wyl99]) that address parallel disks, overlapping of I/O and computation, and have a low internal overhead. However, we are not aware of fast implementations that give theoretical performance guarantees on achieving asymptotically optimal I/O. Most practical implementations use a form of striping that requires
On the other hand, many of the practical merits of our implementation are at least comparable with the best current implementations: We are close to the peak performance of our system.
Perhaps the most widely used external memory sorting algorithm is k-way merge sort: During run formation, chunks of
There are many ways to overlap I/O and run formation. We start with a very simple method that treats internal sorting as a black box and therefore can use the fastest available internal sorters. Two threads cooperate to build k runs of size
post a read request for runs 1 and 2 thread A: | thread B: for r:=1 to k do | for r:=1 to k-2 do wait until | wait until run r is read | run r is written sort run r | post a read for run r+2 post a write for run r |
The figure above illustrates how I/O and computation is overlapped by this algorithm. Formalizing this figure, we can prove that using this approach an input of size N can be transformed into sorted runs of size
In [DemSan03] one can find an algorithm which generates longer runs of average length
We want to merge k sorted sequences comprising N' elements stored in
The prediction sequence
The keys of the smallest elements in each buffer block are kept in a tournament tree data structure [Knu98] so that the currently smallest element can be selected in time
We have now defined multi-way merging from the point of view of the sorting algorithm. Our approach to merging slightly deviates from previous approaches that keep track of the run numbers of the merge blocks and pre-assign each merge block to the corresponding input sequence. In these approaches also the last key in the previous block decides about the position of a block in
Although we can predict the order in which blocks are read, we cannot easily predict how much internal work is done between two reads. For example, consider k identical runs storing the sequence
The I/Os for the run formation and for the output of merging are perfectly balanced over all disks if all sequences are striped over the disks, i.e., sequences are stored in blocks of B elements each and the blocks numbered
The merging algorithm presented above is optimal for the unrealistic model of Aggarwal and Vitter [AggVit88] which allows to access any D blocks in an I/O step. This facilitates good performance for fetching very irregularly placed input blocks. However, this model can be simulated using D independent disks using randomized striping allocation [VitHut01] and a prefetch buffer of size
Run Formation. We build runs of a size close to
We have two implementations with respect to the internal work: stxxl::sort is a comparison based sorting using std::sort from STL to sort the runs internally; stxxl::ksort exploits integer keys and has smaller internal memory bandwidth requirements for large elements with small key fields. After reading elements using DMA (i.e. the STXXL direct access), we extract pairs
Furthermore, we exploit random keys. We use two passes of MSD (most significant digit) radix sort of the key-pointer pairs. The first pass uses the m most significant bits where m is a tuning parameter depending on the size of the processor caches and of the TLB (translation look-aside buffer). This pass consists of a counting phase that determines bucket sizes and a distribution phase that moves pairs. The counting phase is fused into a single loop with pair extraction. The second pass of radix sort uses a number of bits that brings us closest to an expected bucket size of two. This two-pass algorithm is much more cache efficient than a one-pass radix sort. (On our system we get a factor of 3.8 speedup over the one pass radix sort and a factor of 1.6 over STL's sort which in turn is faster than a hand tuned quicksort (for sorting
Multi-way Merging. We have adapted the tuned multi-way merger from [San00b], i.e. a tournament tree stores pointers to the current elements of each merge buffer.
Overlapping I/O and Computation. We integrate the prefetch buffer and the overlap buffer to a read buffer. We distribute the buffer space between the two purposes of minimizing disk idle time and overlapping I/O and computation indirectly by computing an optimal prefetch sequence for a smaller buffer space.
Asynchronous I/O. I/O is performed without any synchronization between the disks. The prefetcher computes a sequence O_DIRECT
so that blocks are directly moved by DMA (direct memory access) to user memory. A fetched block then travels to the prefetch/overlap buffer and from there to a merge buffer simply by passing a pointer. Similarly, when an element is merged, it is directly moved from the merge buffer to the write buffer and a block of the write buffer is passed to the output queue of a disk simply by passing a pointer to the the I/O layer of STXXL that then uses write
to output the data using DMA.