In the previous post we briefly described what might stand beyond asymptotic analysis of algorithms and data structures when it comes to empirical measurements of performance. In this post I continue talking about latter considering more closely the impact of memory hierarchies of modern computer architectures. A few basic data structures are taken for comparison of locality utilization with short explanations. Further, we’re going to touch on effective utilization of CPU-cache by showing some techniques to improve performance. All benchmarks are done on JVM HotSpot 6. Due to different algorithms of GC and memory allocation, the techniques might not / partly work on other platforms, but the idea to improve locality should work on the majority of modern CPU-architectures. I advice that every software developer should read this excellent article (especially “sections 3 and 6) to better understand CPU-caches and memory and then come back to this post.
Memory hierarchies and difference in speed
There’s a huge gap between the speed of CPUs and the latency of DRAM-memory. CPU’s cache-memory is roughly 100 times faster than main memory which in turn 10k times faster than secondary storage. Thus, the processor will have to wait more than 100 cycles every time the memory is needed to deliver data. This problem is solved by sticking smaller, faster memory chips in between the processor and the main memory. These chips are called CPU-caches. Techniques in which cache is heavily utilized during the execution of a program can dramatically impact the performance of algorithms. Caches improve performance when memory accesses exhibit locality.
Locality of reference
Hierarchical memory is a hardware optimization that takes the benefits of spatial and temporal locality and can be used on several levels of the memory hierarchy. Locality describes the same value or related storage locations being frequently accessed from memory. There are two basic types of locality:
- Temporal locality refers to the reuse of specific data and/or resources within relatively small time durations. Access latency can be avoided by reusing the data fetched previously. With poor temporal locality, data that could have been reused is lost from a working area and must be fetched again. Loops over the same small amount of data result in excellent temporal locality.
- Spatial locality refers to the use of data elements within relatively close storage locations. The concept that likelihood of referencing a resource is higher if a resource near it was just referenced. With this type of locality, access latency can be amortized by retrieving multiple data items in a single fetch cycle and using good spatial locality to ensure that all the data fetched is useful. If the data is not useful, because of poor spatial locality, then more fetch cycles will be required to process the same data. A simple array is a good candidate for this type of locality.
When high speed is needed it is very important to understand what data structures exhibit better locality of CPU-caches and how to improve one. Good locality is a good speed of data access affecting both throughput and latency. For CPU-caches the most common replacement policy is to fetch blocks from slower memory (DRAM) into fast memory (SRAM) 1 cache-line at a time when block of memory is touched by a program which doesn’t reside in the cache, a kind of LRU that might differ not significantly from pure LRU-policy. Note that how the program-level locality is mapped onto memory-level depends on the compiler’s layout of objects and on the allocator’s placement of those objects in memory in the operating system. Most compilers layout data in consecutive areas of memory. Normally, compilers preserve the order of variables which helps to achieve a good locality on a program-level. The conventional wisdom is that programs spend 90% of their time executing 10% of the code. By placing the most common instructions and data in the fast-but-small storage, while leaving the rest in the slow-but-large storage, we can lower the average memory-access time of a program significantly. Next, we consider locality effects on classic algorithms and data structures with techniques.
Arrays, linked lists and locality
Now let’s consider a simple analysis by example exhibiting poor locality of reference.
I run the benchmark on an Intel Core 2 Duo CPU 3GHz using:
- JDK: 1.6.0_27
- JVM-params: -Xms512m -Xmx512m -XX:CompileThreshold=1 -XX:newRatio=128 -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
The benchmark was taken 100 times in a row to calculate average values. I performed a simple test by appending a single element in a loop N times to the end of a LinkedList and then did the same for an ArrayList. Mind you, the complexity of the operation on both collections is equal O(1).
As can be seen on the graph ArrayList is a winner. In this benchmark I tried to do the same test with preallocated array for the first test and by default for the second one. Of course, for non-preallocated ArrayList total time grew higher, but nevertheless one has beaten LinkedList with the above figures.
The thing is that a dynamic array allocates all elements contiguously in memory usually in a single continuous block of memory whereas a linked list contains its elements fragmentally. That is, nodes of a linked list can be scattered in arbitrary areas of memory. As have been said in the beginning caches of modern processors don’t like random access to memory. Therefore, sequential access in arrays is faster than on linked lists on many machines, because they have a very good locality of reference and thus make good use of data caching.
Recall that LinkedList will have to do additionally a memory allocation on each insertion of a new element. But, wait, access to elements in the memory takes longer than from CPU-cache. You can check this using a bit different test by iterating these two data structures from left to right with the exact parameters as above and see. Again, the iteration time is slower in LinkedList due to locality of reference. The ArrayList
- Unrolled LinkedList – it can dramatically increase cache performance, while decreasing the memory.
- False sharing – cache-optimization for arrays in SMP CPU-architecture – another technique used in multithreading.
The bottom line: Arrays are superior at exploiting CPU-cache in a sequential access unlike linked data structures. Thanks to spatial locality! The problem of linked lists is when the node is accessed, the whole cache line is fetched from main memory, yet it is mostly not used.
Large array and better cache utilization
Large Arrays in Hot Spot JVM are placed in contiguous memory as many other memory allocators try to do. However, all their elements might not fit into cache. An alternative is to split the array up into smaller ones so that each one fits into CPU-cache. The size of such a small array depends on many factors. I’ll try to explain how to tune array-sizes in later posts.
Hash tables exhibit poor locality of reference, because they cause access patterns that jump around, this can trigger microprocessor cache misses that cause long delays. Randomization causes bad locality. Compact data structures such as arrays searched with linear search may be faster in some cases. An interesting article “How caching affects hashing” reveals the fact that the number of collisions a hash table produces, may lead to superior performance. In this paper three basic approaches were analyzed: linear probing, linear double hashing and exponential double hashing. All three make up so called Open addressing or closed hashing method of collision resolution. Open addressing can lead to more concise tables yielding a better cache performance than classing bucketing. But as the load factor starts to get high its performance might downgrade. The experimental analysis provided that, assuming a nearly uniform distribution, linear probing outperforms double hashing due to the fact that the percentage of the cache hits per probe is higher in the case of linear probing provided that its data set is not very large. However, If data doesn’t fit into memory, linear probing may work slower. A great minus of this type of implementation is that the operation of deletion in open addressing is very expensive because of its O(n) worst time complexity on the array.
If a hash table tends to have many collisions, we can apply “Unrolled linked list” described above. Ideally each linked list’s element should occupy one cache line on the appropriate cache level. A great minus is that the size of a cache line is CPU-architecture-bound.
Binary Search Trees
Here I consider classic unbalanced Binary Searh trees (aka BST), Red-Black trees, AVL-trees and Splay-trees (aka The splay tree of Sleator and Tarjan) in terms of locality. Each tree should be applied in a different situation. All of these are linked data structures made up of nodes. Each node have 3 pointers: parent, left child and right child. Locality in trees is a tendency to look for the same element multiple times. Note that a set of operations exhibits no locality if every element is equally likely to be accessed at each point. Therefore, we’re going to consider here only those cases where elements are accessed multiple times.
Splay-trees are the most intriguing due to the fact that they simply have the ability to optimize themselves for better locality of reference. The tree performs rotations of nodes to the root every time an access is made. Also note that they are not balanced trees as BST. Despite the fact that a worst case bound on the Splay-operation is O(n) for n nodes the amortized time for a set of operations is quite efficient (O(lg n)) which is compensated by these rotations and locality. Here we are talking about “top-down splay-tree” variation. Splay trees are the winner of locality among these ones when insertions happen quite frequently in sorted order and later accesses are sequential. Frequently used nodes are located near the root. In almost all other cases, because of the high cost of maintaining self-adjustment. Random insertion is the worst among all 4 data structures due to splay-overhear. On the contrary, the AVL-tree outperforms a Splay-trees when the search-key distribution is uniform and very frequent.
If insertions happen quite frequently in sorted order the locality of AVL-trees is quite good provided that later accesses are more random. In other cases it may carry out far more comparisons than other trees which deteriorates its performance and therefore may stand behind others. The search performance of the AVL-tree is almost always better than that of Splay-trees.
If input is randomly ordered but sequential traversal happen frequently then red-black trees should be used. Random insertions perform better over AVL-trees. However, for pathological inputs AVL-operations work faster than in Red-Black tree due to the stricter rule of rebalancing.
When randomly ordered input can be relied upon it is best to use this basic kind of binary search trees over others. It requires the least extra overhead unlike the other tree-structures.
The bottom line: Input set and distribution of data both matter. In addition to locality, sometimes other factors are much more important for performance.
For random input set: BST – is the fastest among 4 remaining ones, then goes Red-Black tree, then goes the AVL-tree and the slowest one is a Splay-tree.
Splay trees take much of the CPU-time mostly on rotations where they lose in speed. There are some optimizations towards fewer splay-operations for certain cases, but they are not discussed in this blog. Unbalanced BSTs are simpler in implementation and have lighter operations and only best work against random data.
For pathological input set the picture is the opposite – from fastest to slowest: Splay-tree due to high locality, AVL-tree, Red-Black tree, BST – is extremely slow as it is unbalanced.
As this series is devoted solely to locality and some facts mentioned are not directly related to it, in later series I’ll try to give some empirical benchmarks on overall performance of these structures to make the picture more clear.
it is worth noting that hidden constants caused by locality of reference might differ depending on computer architecture and implementation. Multiple operations on data structures with non-sequential access to elements cause poor performance. Asymptotic comparison of cache-friendly data structures with others is meaningless because in reality the result can be quite the contrary. Defragmented location of related elements in memory causes CPU cache-losses which can drastically degrade overall performance. Especially It is sensible on large data volumes where low latency is at premium. Mind you, algorithm with good locality is not sufficient for better performance. A number of operations, their cost including CPU-time do matter too.