These are few suggestions in order to get the best performance on the Intel Knights Landing (KNL).
Bind the memory allocation to the MCDRAM NUMA node
The KNL has two memory systems, the DDR4 (~90 GFlops/s) and the High Bandwidth Memory (MCDRAM, ~400 Gflops/s). Each of the two memory system is attached to a different NUMA context.
On a KNL node the command numactl --hardware
will report which NUMA context is connected to the faster MCDRAM.
A typical report looks like this
node 0 size: 98178 MB
node 0 free: 92899 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15926 MB
In this case the node 1 is related to the 16GB MCDRAM (this is the typical situation on KNLs)
To bind the memory allocation to NUMA node 1 use
numactl --membind 1 ./your-executable
Controlling threading
The number of threads can be set in GRID at runtime by the flag
--threads <#threads>
A finer control can be achieved using the environment variable KMP_HW_SUBSETS
(or the deprecated KMP_PLACE_THREADS
).
From the Intel developer guide:
The KMP_HW_SUBSETS variable controls the hardware resource that will be used by the program. This variable specifies the number of sockets to use, how many cores to use per socket and how many threads to assign per core. For example, on Intel® Xeon Phi™ coprocessors, while each coprocessor can take up to four threads, specifying fewer than four threads per core may result in a better performance. While specifying two threads per core often yields better performance than one thread per core, specifying three or four threads per core may or may not improve the performance. This variable enables you to conveniently measure the performance of up to four threads per core.
A typical setting for the best performance on a single node is to use 62 cores with 1 threads per code, on the bash shell this is set by
export KMP_HW_SUBSETS=62c,1t
Using the optimised Wilson Dslash kernels
Beside the generic implementation using stencils, GRID has optimised version of the Dslash kernels (for Wilson and DWF fermions).
Flags at runtime can be used for the optimised paths
Flag | Description |
---|---|
--dslash-generic |
This is the default option and used the implementation with stencils |
--dslash-unroll |
This explicitly unroll the colour loops. It is tied to Nc=3 |
--dslash-asm |
This is specific for AVX512-F architectures and Nc=3 |
The information included in this page has been updated on November 2016 and it is valid for the release version 0.6.0.