Background Information for STREAM Improvement Post
Post date: Sep 29, 2014 5:58:17 PM
Here is additional technical information for the previous post "Suggestions for Improvement of STREAM OpenMP Code", for the benefit of readers unfamiliar with the technology of the systems.
AMD Opteron Has Integrated Memory Controller
AMD Opterons have integrated memory controllers. The memory bandwidth scales with the number of CPUs as every CPU has its own integrated memory controller and its own banks of DIMMs.
Opterons are connected to one another through point to point HyperTransport links. Access to non-local DIMMs on other CPUs is through the HyperTransport links. If a thread requires data residing on DIMMs belonging to a CPU that is not connected directly to the CPU where the thread runs in, then multiple hops to the CPU with the DIMM is done to access the required data.
As the memory access latency is non-uniform and depends on the locality of the memory, AMD Opteron systems are called Non-uniform Memory Access or NUMA for short. Local memory access is the fastest, the more node (in the case of Opteron, a CPU is a NUMA node) hops required to access foreign memory, the greater is the memory access latency.
Silicon Graphics Origin 2000, One of the Early NUMA Systems
One of the early NUMA systems is the Silicon Graphics Origin 2000, which scales by connecting two CPU nodes to other nodes through a system of hubs, routers and CrayLink Interconnect. Silicon Graphics, Inc. was famous for its distributed shared memory, single system image ccNUMA (cache coherent Non-uniform Memory Access) systems that could scale far bigger than other single systems. The Origin 2000 could scale up to 512 MIPS CPUs (single core in those days). NASA had multiple Origin 2000, including one with 512 CPUs, and an Origin 3000 with 1024 CPUs.
As a student in 1998, I used two different Origin 2000, one in National University of Singapore with sixteen MIPS R10000 195 MHz, and the other in Institute of High Performance Computing, that started off with sixteen MIPS R10000 195 MHz, that was later expanded to sixty-four MIPS R10000 with mixture of 195 and 250 MHz.
One has to take into account memory locality when writing program for the Origin 2000 as remote memory access is more expensive. The memory access latency increases with the number of node hops to access the memory where the required data resides in.
IRIX, the Silicon Graphics operating system on Origin 2000, has first touch memory placement policy that faults in pages on the local node of the thread whenever possible. IRIX also has a utility "dplace" that allows user to either let the system run a parallel program on nodes that are close together, or even to specify the exact nodes to run a parallel program. One has to understand the topology of the Origin system one is using to know which nodes are the closest together to get the best performance. Obviously, having optimum program placement when one is time sharing an Origin with programs of other users is a lot more difficult.
Therefore as a student writing my own computational mechanics code, I was conscious of the need to be NUMA aware. The next NUMA system I used as a student was the Compaq AlphaServer GS320 with twenty-two EV6.7 Alpha 21264A 731 MHz in National University of Singapore.
AMD Opteron Introduced Both 64 bit and NUMA into Commodity Computing Systems
64 bit computing, and NUMA systems, that started off in high end computing systems, went mainstream for commodity computing systems when AMD introduced Opteron in 2003, for systems with one to eight CPUs.
In an AMD Opteron system, every CPU is a NUMA node.
Architecture of Sun Fire X4600 M2
I have attached a white paper, "Sun Fire X4600 M2 Server Architecture". This white paper gives detailed information on the architecture of the Sun Fire X4600 M2 and the AMD Opteron.
The CPU topology is explained on pages 21, 22, 23. X4600 M2 has an Enhanced Twisted Ladder CPU to CPU connection topology to minimize hop distance. As shown on page 23, with eight CPUs, the maximum number of hops between any two CPUs is three, between CPU 0 and 7. The hop count between any other two CPUs is either one or two. CPU 0 and 7 each has total hop count to other CPUs of thirteen. The other CPUs each has total hop count to other CPUs of eleven.
What this topology implies is that CPU 0 and 7 have higher cache coherency overhead than the other CPUs due to more hops required; threads running in CPU 0 and 7 will have lower performance than threads running in the other CPUs.
Sun Fire X4600 and X4600 M2 models are identical except for differences in the CPU modules and PSUs. There are CPU modules with Socket 940 and DDR DIMMs, Socket F and DDR2 DIMMs, Socket F+ and DDR2 DIMMs. CPU modules have either four or eight DIMM slots. X4600, which has CPU modules with Opteron Socket 940 CPUs, can be converted to X4600 M2, by replacing with new CPU modules with Opteron Socket F or Opteron Socket F+ CPUs.
Note that while the white paper refers to the X4600 M2 as having eight DIMM slots in each CPU module, the X4600 M2 I had in Data Storage Institute has eight CPU modules, each with AMD Opteron 8224 SE and four DIMM slots.
Memory Bandwidth of AMD Opteron System Scales with the Number of CPUs, But Not Linearly
On the AMD Opteron 8224 SE and 2224 SE, there is only one memory controller on every CPU, so running on all sixteen cores in Sun Fire X4600 M2 will not scale memory bandwidth sixteen times. There is also the overhead of cache coherency. Besides having to maintain cache coherency across cores in the same CPU, cache coherency also has to be maintained across all the cores in the other CPUs in the system. The greater the number of CPUs, the greater the overhead of cache coherency. Therefore the raw memory bandwidth will not scale linearly with the number of CPUs. And of course there is the overhead of threads and so on.
Starting from the "Istanbul" generation of Opteron introduced in 2009, AMD implemented "HT Assist", enabled on systems with more than two CPUs. Part of the shared L3 cache on each CPU is used as a directory cache that tracks the location of cache lines in the system. This greatly reduces the cache coherency traffic on the HyperTransport links. The use of "HT Assist" and higher clocked HyperTransport links result in higher system bandwidth and memory bandwidth.
According to AMD, the STREAM benchmark result increased significantly when "HT Assist" is enabled on a four Istanbul CPU system.
Sun Fire X4600 M2 with Eight Opteron Has Higher Memory Bandwidth than Sun Ultra 40 M2 with Two Opteron
In my tests, the Sun Ultra 40 M2 with two Opteron 2224 SE (same clock speed and specifications as the 8224 SE in the Fire X4600 M2 except for the number of HyperTransport links) has slightly higher stock STREAM results than the Sun Fire X4600 M2 with eight Opteron 8224 SE. The DIMMs on the Ultra 40 M2 are 2 GB DDR2 667 MHz (4 DIMMs on each CPU, total of 16 GB), those on the Fire X4600 M2 are 4 GB DDR2 667 MHz (4 DIMMs on each CPU, total of 128 GB). The memory bandwidth of the X4600 M2 is definitely much higher than the Ultra 40 M2.
If the memory bandwidth of a two CPU workstation is higher, something is not right with the measurement. We had Fluent Computational Fluid Dynamics simulations running 24x7 in these Fire X4600 M2 and Ultra 40 M2. The Fluent parallel licenses (MPI) are not cheap. My colleagues will have found out that running with more than 2 CPUs does not do any speedups. That is not the case.
TSUBAME at Tokyo Institute of Technology
The original TSUBAME at Tokyo Institute of Technology was built with more than six hundred Sun Fire X4600, each with eight CPUs, running SUSE Linux Enterprise Server, connected over Infiniband. TSUBAME was the flagship show case of both Tokyo Institute of Technology and Sun Microsystems. TSUBAME was used to run various commercial and free Computer Aided Engineering and scientific computing applications by researchers in Tokyo Institute of Technology.
TSUBAME was ranked seventh in the June 2006 Top500 list. Since TSUBAME was designed to be used as a general purpose cluster for a wide user base, due diligence would have been done to ensure the Sun Fire X4600 did have satisfactory scaling and performance, including when running off-the-shelf applications which are not specifically optimized for X4600, as any bottlenecks would have been apparent and the investment cost certainly was not trivial.
Intel Xeon Systems with Front Side Bus Have Bottlenecked Memory Bandwidth
Intel Xeons of the same era have Front Side Bus and shared memory controller on the chipset. There is only ONE memory controller on the chipset that has to provide memory access to all the CPUs in the system, whether it is a system with one, two or four CPUs; all of the CPU cores in the system have to access memory through the Front Side Bus and the single memory controller on the chipset. The memory access latency is higher than an Opteron system with integrated memory controller.
Those Xeon systems are notorious for bottlenecked memory bandwidth, and depend on big CPU caches for benchmark results. Xeon systems with memory controller on chipset do not scale memory bandwidth with the number of CPUs. An Intel Xeon system with four CPUs and many CPU cores, that has the same or even lower system memory bandwidth than a two CPU Xeon system is obviously starved of memory bandwidth.
The AMD Opteron systems beat the Intel Xeon systems for floating point performance as floating point intensive codes often require intensive memory access, and even the big Xeon CPU caches cannot compensate for poor memory access bandwidth and latency.