AWS New Instance Types and STREAM Benchmarking

Post date: Jun 06, 2017 9:12:52 AM

Introduction

This is a more detailed write-up of the "AWS New Instance Types and STREAM Benchmarking Presentation" post on 25 March 2016.

I had not logged into Amazon Web Services for two years.

The new instance types are very different, with the latest hardware including Intel Haswell processors and SSD storage. The virtualization type has also changed since the last time I used AWS.

The instance types table below is copied from Amazon EC2 Instance Types documentation.

Agenda

- A brief technical introduction to the new AWS instance types.
- There is always the concern that cloud VMs might be sharing compute resources that are over subscribed and so would be slow.
- A quick way to find out the compute performance is to run the STREAM benchmark by John D. McCalpin. This gives the memory bandwidth as well as performance of some floating point operations.
- Benchmarking with STREAM to answer the question if the new instance types are suitable for memory access intensive and floating point intensive work loads.

Disclaimer

- Note that as AWS does not in general reveal the details of its infrastructure; some of the statements made here are based on educated guess work, so I may be totally incorrect.
- If in doubt always consult your favourite AWS Account Manager and Architect to seek clarifications (if they are able to tell you).
- The screen shots are taken from AWS documentation (without permission, I do not claim copyright, and hopefully nobody minds).
- The AWS details and documentation are taken from March 2016.

AWS New Instance Types

The following are details of AWS new instance types taken from screen shots of the AWS documentation:

Virtualization Type

Paravirtualization (PV) used to be the default virtualization type in AWS for non-Windows and non-GPU instances. PV is fast and integrated with the Xen hypervisor and so is the recommended virtualization type.

In the new AWS instances, the virtualization type is the so called HVM, which is Hardware-assisted virtualizion. HVM gives access to new Intel AVX and AVX2 instructions and Enhanced Networking, which gives higher performance.

From what I observe of the storage and Ethernet kernel modules, it would be more accurate to describe the virtualization type as PVHVM, which is HVM with special paravirtual device drivers.

The block storage uses a paravirtual kernel module.

Network kernel module is Intel ixgbevf, which works through SR-IOV. That means the VM has direct access to the 10 Gigabit Ethernet hardware.

EBS SSD Storage Performance Profiles

The storage performance profiles of EBS General Purpose SSD volumes is not something that is well known. We are talking about a true Software Defined Storage QoS here, with concepts like Baseline IOPS, I/O Credit Balance, I/O Credit Accumulation, Maximum Burst IOPS, Throughput Limits.

One needs to carefully read the AWS documentation to understand these concepts. One also needs to understand the AWS definition of I/O size. I am blown away by the ability of AWS to do real time metering and control of VM I/O.

The following are details of AWS EBS General Purpose SSD I/O profiles taken from screen shots of the AWS documentation:

Test VM Details

All instances were run in US West (Oregon) as this region has the lowest pricing.

All instances are Amazon Machine Image (AMI) SUSE Linux Enterprise Server 12 SP1 (HVM), SSD Volume Type - ami-d2627db3.

The instances were run in us-west-2a, in my VPC, with one network interface.

The root disk is a 30GiB EBS General Purpose SSD (GP2). As the so-called IOPS is based on 3 IOPS per GiB, I provisioned 30 GiB so as to get a nominal 90 IOPS. The same EBS root disk was re-used for all benchmarked instances so all software installations were only done once.

STREAM Benchmarking

I wanted to know the compute performance of the new instances. There is always the concern that cloud VMs might be sharing compute resources that are over subscribed and so would be slow.

A quick way to find out the compute performance is to run the STREAM benchmark by John D. McCalpin. This gives the memory bandwidth as well as performance of some floating point operations.

Benchmarking with STREAM answers the question if the new instance types are suitable for memory access intensive and floating point intensive work loads.

STREAM was created by John D. McCalpin. The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

The following is the output:

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 4000000000 (elements), Offset = 0 (elements)

Memory per array = 30517.6 MiB (= 29.8 GiB).

Total memory required = 91552.7 MiB (= 89.4 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 20

Number of Threads counted = 20

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 793720 microseconds.

(= 793720 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Best Rate MB/s Avg time Min time Max time

Copy: 57490.6 1.117605 1.113225 1.125540

Scale: 57499.7 1.117975 1.113050 1.126502

Add: 65773.2 1.461796 1.459562 1.465515

Triad: 65489.6 1.472524 1.465881 1.491398

-------------------------------------------------------------

Solution Validates: avg error less than 1.000000e-13 on all three arrays

STREAM runs loop of 10 iterations to get best score of the following measurements, and is parallelized using OpenMP constructs:

Function Copy

#pragma omp parallel for

for (j=0; j<STREAM_ARRAY_SIZE; j++)

c[j] = a[j];

Function Scale

#pragma omp parallel for

for (j=0; j<STREAM_ARRAY_SIZE; j++)

b[j] = scalar*c[j];

Function Add

#pragma omp parallel for

for (j=0; j<STREAM_ARRAY_SIZE; j++)

c[j] = a[j]+b[j];

Function Triad

#pragma omp parallel for

for (j=0; j<STREAM_ARRAY_SIZE; j++)

a[j] = b[j]+scalar*c[j];

STREAM version 5.10 was used. There is a STREAM_ARRAY_SIZE definition that has to be customized for each AWS instance benchmarked. The number of OpenMP threads has to selected for the tests.

Additional Test VM Details

I updated all instances using SUSE Online Update.

The kernel is Linux version 3.12.51-60.20-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux) ) #1 SMP Fri Dec 11 12:01:38 UTC 2015 (1ca22d2).

Two compilers were installed and used to compile STREAM version 5.10 for the benchmarking runs:

GCC (gcc version 5.2.1 20150721 [gcc-5-branch revision 226027] ) installed from the SLES repository,

Oracle Solaris Studio 12.4, installed using RPM packages and unpatched as I have no support subscription.

The compiler command and options used to compile with gcc:

gcc-5 -v -fopenmp -Ofast -march=native -mtune=native -mcmodel=medium stream-80GB.c -o stream-80GB_gcc-m4.10xlarge

The compiler command and options used to compile with Oracle Solaris Studio:

cc -# -fast -xopenmp -xvector=simd -xmodel=medium stream-80GB.c -o stream-80GB_m4.10xlarge

Cron was disabled. The default image does not have X server installed.

The file system is the image default ext4. There is no device swap allocated in the AWS image.

The OpenMP environment variables are:

OMP_NUM_THREADS=numthreads

OMP_SCHEDULE=static

OMP_DYNAMIC=false

OMP_PROC_BIND=true

All tests were run 10 times, with the maximum result number during the 10 runs for each function reported in the tables below.

Notes

I have tested with more than half the memory of the instances to eliminate CPU cache effects. Do note the memory size used when comparing with other results.

Intel Xeon Haswell E5 has four memory channels per processor. There are two processors per server.

Memory used is DDR4. If DDR4 2133 MHz DIMMs, which has a peak transfer rate of 17,067 MB/s, are used, one processor has maximum theoretical bandwidth of about 68 GB/s, and two processors have maximum theoretical bandwidth of about 176 GB/s.

Results

The results of the various instance types are given below along with the review and comments.

m4.xlarge

4 vCPU 16 GiB Memory Intel Xeon E5-2676 v3 2.4 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 400000000 (elements), Offset = 0 (elements)

Memory per array = 3051.8 MiB (= 3.0 GiB).

Total memory required = 9155.3 MiB (= 8.9 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 4

Number of Threads counted = 4

-------------------------------------------------------------

1 Processor 2 Cores 4 logical CPUs

4 threads

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 800000000 (elements), Offset = 0 (elements)

Memory per array = 6103.5 MiB (= 6.0 GiB).

Total memory required = 18310.5 MiB (= 17.9 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 8

Number of Threads counted = 8

-------------------------------------------------------------

1 Processor 4 Cores 8 logical CPUs

8 threads

8 threads was found to give higher results than 4 threads

m4.2xlarge

8 vCPU 32 GiB Memory Intel Xeon E5-2676 v3 2.4 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 1600000000 (elements), Offset = 0 (elements)

Memory per array = 12207.0 MiB (= 11.9 GiB).

Total memory required = 36621.1 MiB (= 35.8 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 16

Number of Threads counted = 16

-------------------------------------------------------------

1 Processor 8 Cores 16 logical CPUs

16 threads

8 threads was found to give higher results than 16 threads except for Triad

m4.4xlarge

16 vCPU 64 GiB Memory Intel Xeon E5-2676 v3 2.4 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 4000000000 (elements), Offset = 0 (elements)

Memory per array = 30517.6 MiB (= 29.8 GiB).

Total memory required = 91552.7 MiB (= 89.4 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 20

Number of Threads counted = 20

-------------------------------------------------------------

2 NUMA Processors 10 Cores each 40 logical CPUs total

20 threads

20 threads was found to give higher results than 40 threads

Intel P-state driver controlled core frequencies

“performance” governor was used during benchmarking

m4.10xlarge

40 vCPU 160 GiB Memory Intel Xeon E5-2676 v3 2.4 GHz

- m4.10xlarge instance has 20 CPU cores out of a total of 24
- m4.10xlarge allows user to control CPU P-state and C-state
- 10 Gigabit Ethernet
- m4.10xlarge is an instance in a de facto dedicated server with no sharing of resources with other instances
- Very useful when it is desirable to have no time sharing of CPU and memory with other instances for maximum performance

c4.large

2 vCPU 3.75 GiB Memory Intel Xeon E5-2666 v3 2.9 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 100000000 (elements), Offset = 0 (elements)

Memory per array = 762.9 MiB (= 0.7 GiB).

Total memory required = 2288.8 MiB (= 2.2 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 2

Number of Threads counted = 2

-------------------------------------------------------------

1 Processor 1 Core 2 logical CPUs

2 threads

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 200000000 (elements), Offset = 0 (elements)

Memory per array = 1525.9 MiB (= 1.5 GiB).

Total memory required = 4577.6 MiB (= 4.5 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 4

Number of Threads counted = 4

-------------------------------------------------------------

1 Processor 2 Cores 4 logical CPUs

4 threads

4 threads was found to give higher results than 2 threads

c4.xlarge

4 vCPU 7.5 GiB Memory Intel Xeon E5-2666 v3 2.9 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 400000000 (elements), Offset = 0 (elements)

Memory per array = 3051.8 MiB (= 3.0 GiB).

Total memory required = 9155.3 MiB (= 8.9 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 8

Number of Threads counted = 8

-------------------------------------------------------------

1 Processor 4 Cores 8 logical CPUs

8 threads

8 threads was found to give higher results than 4 threads

c4.2xlarge

8 vCPU 15 GiB Memory Intel Xeon E5-2666 v3 2.9 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 800000000 (elements), Offset = 0 (elements)

Memory per array = 6103.5 MiB (= 6.0 GiB).

Total memory required = 18310.5 MiB (= 17.9 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 8

Number of Threads counted = 8

-------------------------------------------------------------

1 Processor 8 Cores 16 logical CPUs

8 threads

8 threads was found to give higher results than 16 threads

c4.4xlarge

16 vCPU 30 GiB Memory Intel Xeon E5-2666 v3 2.9 GHz

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 1600000000 (elements), Offset = 0 (elements)

Memory per array = 12207.0 MiB (= 11.9 GiB).

Total memory required = 36621.1 MiB (= 35.8 GiB).

Each kernel will be executed 10 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 18

Number of Threads counted = 18

-------------------------------------------------------------

2 NUMA Processors 9 Cores each 36 logical CPUs total

18 threads

18 threads was found to give higher results than 36 threads

Intel P-state driver controlled core frequencies

“performance” governor was used during benchmarking

c4.8xlarge

36 vCPU 60 GiB Memory Intel Xeon E5-2666 v3 2.9 GHz

- c4.8xlarge instance has 18 CPU cores out of a total of 20
- c4.8xlarge allows user to control CPU P-state and C-state
- 10 Gigabit Ethernet
- c4.8xlarge is an instance in a de facto dedicated server with no sharing of resources with other instances
- Very useful when it is desirable to have no time sharing of CPU and memory with other instances for maximum performance

i2.8xlarge

- 32 vCPU, 244 GiB Memory, 8x 800 GB SSDs, 10 Gigabit Ethernet, Intel Xeon E5-2670 v2 2.5 GHz
- Lots of CPU cores and memory, plus local system SSDs for storage low latency applications

d2.8xlarge

- 36 vCPU, 244 GiB Memory, 24x 2000 GB hard drives, 10 Gigabit Ethernet, Intel Xeon E5-2676 v3 2.4 GHz
- Lots of CPU cores and memory, plus big number of local system hard drives for storage applications

When I did online update of SLES 12 SP1, the instance failed to boot after rebooting. I had the exact same problem after deleting the instance and creating a new instance, then doing the update. I suspect the disk device names got changed upon reboot after the kernel update, which is why the reboot failed.

AWS does not have interactive VM console where one can actually do recovery actions. Do remember to create snapshot of your EBS system disk before doing OS update to enable recovery action.

Conclusions

- I was blown away by the STREAM results of the new AWS M4 and C4 instances.
- It is possible to run memory access intensive and floating point intensive applications on the new AWS M4 and C4 instances.
- Note that a vCPU is defined as an Intel Hyperthread. Do testing to check whether running more or fewer threads gives higher performance for a specific application. In case of m4.10xlarge and c4.8xlarge instances which are sole instances in de facto dedicated servers, running STREAM with the same number of threads as physical CPU cores gives best results.

Note that NUMA effects should be taken into consideration for m4.10xlarge and c4.8xlarge as there are two NUMA nodes.
NUMA nodes are hidden by the virtualization for the other instance types.
M4 instances has more than double the memory of the C4 instances but C4 instances have higher memory bandwidth and faster floating point performance.
STREAM compiled with Oracle Solaris Studio 12.4 produces OpenMP parallelized runtime binaries with higher performance, due to lower overheads of parallelization, compared to GCC 5.2.1.

Caveats

The following notes about the results should be taken into consideration:

I ran the benchmark ten times successively in each AWS instance tested. All tests were done in US West (Oregon) region only. I did not repeat the tests at different hours or in different regions.

I do not have visibility about other time sharing loads on the servers hosting my instances during the tests. I did my tests during Singapore hours; the US West region is mainly for serving US customers. Moreover, the tests were done from December 30 2015 to 1 January 2016. This is the holiday period in the US when most people are away from work. It could be possible that during the tests there were little or even no other loads running on the servers. If that were the case, the results I obtained might not be reproducible when there are other instances with high loads time sharing the same servers.

To get a true picture for your production usage, test the performance of your instances in the required regions, and during hours of your production usage. Repeat the tests at different hours, and also on different days.

AWSInstSTREAMDetails ‎(Responses)‎

Page updated

Report abuse