Suggestions for Improvement of STREAM OpenMP Code

Post date: Jul 25, 2014 3:44:47 AM

STREAM is often used for system memory bandwidth benchmarking.

But do you know that in reality the STREAM code does not measure the true system memory bandwidth due to inefficiencies in the OpenMP program code?

The STREAM benchmark reported results are the sum of true system memory bandwidth, minus the overhead of the OpenMP implementation, minus the effects of non-local (foreign) NUMA memory access.

I last studied this problem in 2011; when I looked at the current version of STREAM (v 5.10) recently, nothing has changed since 2011.  The code is still exactly the same as in 2011 in the way OpenMP code has been written.  So the same issues I identified in 2011 are still there in the current version of STREAM.

I wrote the following email to STREAM creator John D. McCalpin (mccalpin@cs.virginia.edu) on 16 February 2011, but never received a reply.  I sent him another reminder email on 4 March 2011 but got no reply either.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Suggestions for Improvement of Stream OpenMP Code

To: mccalpin@cs.virginia.edu, CHIN_Gim_Leong@dsi.a-star.edu.sg

Hi John D. McCalpin,

I work for Data Storage Institute, part of A*STAR in Singapore.

I have run the stream benchmark on our system:

Sun Microsystems Fire X4600 M2, eight AMD Opteron 8224 SE 3.2 GHz dual core 1 MB L2 cache per core, 128 GB of 4 GB DDR2 667 MHz DIMMs

SUSE Linux Enterprise Server 10 SP2

I have tried Sun Studio 12, Sun Studio 12 Update 1 ML, Solaris Studio 12.2 on it.

The results are compiled with Sun Studio 12 Update 1 ML which is fully patched, as there are performance regression issues with Studio 12.2.

I have attached the tar ball of the original code, various versions of modified code and the compiled binaries, plus the raw results.

The run time settings are in "runcmd", the compilation flags are in "compile_stream_typhoon_vec".

The reason I am suggesting the improvements is I am observing very low performance numbers with all 16 cores.  The results with 1 core and 2 cores locked to one of the CPUs are in stream_typhoon1_simd_numa_m2_C4_run* and stream_typhoon2_simd_numa_m2_N2_run*.

numactl is used to lock the memory to CPU 2 and CPU core to core ID 4 (numactl -m 2 -C 4), and memory and cores to CPU 2 (numactl -m 2 -N 2) respectively.  The results are different for different CPUs, and CPU 2 gives one of the best performance for one core.

If I do not use numactl for two cores, then the results in stream_typhoon2_simd_run* are for the scheduler to do its own placement.  The arrays are different in size, the single core results are with the default 2 million array size, the two core is calculated with:

1048576        /* 4*2*1024^2/8 = 1048576 */

The 16 core array size is:

8388608        /* 4*16*1024^2/8 = 8388608 */

You can see from stream_typhoon16_simd_run* how low the numbers are.  9k to 11k MB/s for Copy when one core is already 4k MB/s.  There are eight CPUs, and even taking into consideration the snooping traffic on the Hypertransport the numbers cannot be so low.

The theoretical bandwidth is (2*5.3)*8=84.8 GB/s

The other odd thing about the results for the original code is that the numbers for Copy are almost always lower than Scale and Add, and most of the time even Triad, which all involve additional floating point operations and sometimes more arrays.  The sensible thing is that Copy should give the highest numbers since it is the simplest test, with no floating point operations.

I have also included the results for the MPI version in the "mpi" directory.  The numbers are very consistent at 12k MB/s throughout, just slightly higher than the OpenMP version but still far too below potential.

My Sun Ultra 40 M2 workstation with two Opteron 2224 SE has Copy and Scale of up to 13k MB/s with full KDE desktop running, and it inconceivable that the X4600 M2 has lower total bandwidth than the Ultra 40 M2, since I know that the X4600 M2 can do far more work than the Ultra 40 M2, so the actual memory bandwidth must be far higher than the workstation.

You can see that I have tried various modified versions, cglstreamtyphoon16.c to cgl7streamtyphoon16.c.

I think cgl6streamtyphoon16.c is the most optimal way to modify the OpenMP code.

The results are in cgl6streamtyphoon16_simd_run*.

cgl6streamtyphoon16_simd_numal_run* are the results with "numact -l", which do not seem good.

I have some suggestions for improvements to the Stream OpenMP code, with the aim of:

1) Improvement of NUMA performance by having the pages faulted using first touch policy by the cores at the local memory node and having the cores needing to access only those local pages and no foreign pages in the test loops

2) One single parallel region throughout, this reduces the overhead and time of having to fork and join threads, and do thread scheduling, plus the benefit that memory access is always consistent and local.

Looking at the code in cgl6streamtyphoon16.c, you will note that:

1) There is only one parallel region.

2) Variables in the parallel region are private whenever possible.

3) The pages for the arrays are faulted by having the same consistent array access from the array initialization to the various tests and having the schedule "static" defined.  Now the exact same array elements are accessed by each thread in every test loop, as defined on page 43 of "OpenMP Application Program Interface Version 3.0 May 2008".

4) Within the main loop all the threads run the exact same code and make the same function calls.

5) There is a barrier before taking the start time before the beginning of each test loop and an implicit barrier at the end of each test loop before taking the timing after the end of the test loop; each timing code is executed by all threads.

6) Only the master thread executes the code to check the clock precision and do the computation for the performance results.

7) After the above the rest of the code remains unmodified.

As seen in the results in cgl6streamtyphoon16_simd_run*, I am now getting up to 39 MB/s for Copy, 31 MB/s for Scale, 34 MB/s for Add and 22 MB/s for Triad.  I also note that the numbers are not always consistent, and there is actually a big variation.  I suspect this has to do with the Hypertransport snooping traffic, and maybe the OpenMP implementation.

I do note that there is the inconsistency of definition of "MB/s" in your code.

In the computation for total memory required you used 1024 bytes as the definition of 1 kB, 1024^2 as the definition of 1 MB:

    printf("Total memory required = %.1f MB.\n",

        (3.0 * BytesPerWord) * ( (double) N / 1048576.0));

However for the computation of MB/s, you multiple by 1.0E-06 the number of bytes:

    printf("Function      Rate (MB/s)  Avg time    Min time    Max time\n");

    for (j=0; j<4; j++) {

        avgtime[j] = avgtime[j]/(double)(NTIMES-1);

        printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],

              1.0E-06 * bytes[j]/mintime[j],

              avgtime[j],

              mintime[j],

              maxtime[j]);

As this is a little bit computer science, the definition of 1 kB = 1024 bytes would be better, as 1 kB = 1000 bytes is used by the storage industry to get higher looking numbers.

What do you think of my suggestions?

The checksum of the attached "CGLss12u1.tar":

9462547dd03c0f9f0d48448d670b1a6c  CGLss12u1.tar

Thank you.

Chin Gim Leong

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

STREAMimprovement ‎‎‎(Responses)‎‎‎