Lattice objects may be constructed to contain the local portion of a distribued array of any tensor type. For performance reasons the tensor type uses a vector Real or Complex as the fundamental datum.

Every lattice requires a GridBase object pointer to be provided in its constructor. Memory is allocated at construction time. If a Lattice is passed a RedBlack grid, it allocates half the storage of the full grid, and may either store the red or black checkerboard. The Lattice object will automatically track through assignments which checkerboard it refers to. For example, shifting a Even checkerboard by an odd distance produces an Odd result field.

Struct of array objects are defined, and used in the template parameters to the lattice class.

Example (lib/qcd/QCD.h):

       template<typename vtype> using iSpinMatrix = iScalar<iMatrix<iScalar<vtype>, Ns> >;
       typedef iSpinMatrix<ComplexF>           SpinMatrixF; //scalar
       typedef iSpinMatrix<vComplexF>          vSpinMatrixF;//vectorised
       typedef Lattice<vSpinMatrixF>           LatticeSpinMatrixF;

The full range of QCD relevant lattice objects is given below.

Lattice Lorentz Spin Colour scalar_type Field Synonym
Vector Scalar Scalar Scalar RealD LatticeRealD N/A
Vector Scalar Scalar Scalar ComplexD LatticeComplexD N/A
Vector Scalar Scalar Matrix ComplexD LatticeColourMatrixD LatticeGaugeLink
Vector Vector Scalar Matrix ComplexD LatticeLorentzColourMatrixD LatticeGaugeFieldD
Vector Scalar Vector Vector ComplexD LatticeSpinColourVectorD LatticeFermionD
Vector Scalar Vector Vector ComplexD LatticeHalfSpinColourVectorD LatticeHalfFermionD
Vector Scalar Matrix Matrix ComplexD LatticeSpinColourMatrixD LatticePropagatorD

Additional single precison variants are defined with the suffix F. Other lattice objects can be defined using the sort of typedef’s shown above if needed.

Opaque containers

The layout within the container is complicated to enable maximum opportunity for vectorisation, and is opaque from the point of view of the API definition. The key implementation observation is that so long as data parallel operations are performed and adjacent SIMD lanes correspond to well separated lattice sites, then identical operations are performed on all SIMD lanes and enable good vectorisation.

Because the layout is opaque, import and export routines from naturally ordered x,y,z,t arrays are provided (lib/lattice/Lattice_transfer.h):

    unvectorizeToLexOrdArray(std::vector<sobj> &out, const Lattice<vobj> &in);
    vectorizeFromLexOrdArray(std::vector<sobj> &in , Lattice<vobj> &out);

The Lexicographic order of data in the external vector fields is defined by (lib/util/Lexicographic.h):

    Lexicographic::IndexFromCoor(const Coordinate &lcoor, int &lex,Coordinate *local_dims);

This ordering is

Peek and poke routines are provided to perform single site operations. These operations are extremely low performance and are not intended for algorithm development or performance critical code.

The following are “collective” operations and involve communication between nodes. All nodes receive the same result by broadcast from the owning node:

    void peekSite(sobj &s,const Lattice<vobj> &l,const Coordinate &site);
    void pokeSite(const sobj &s,Lattice<vobj> &l,const Coordinate &site);

The following are executed independently by each node:

    void peekLocalSite(sobj &s,const Lattice<vobj> &l,Coordinate &site);
    void pokeLocalSite(const sobj &s,Lattice<vobj> &l,Coordinate &site);

Lattices of one tensor type may be transformed into lattices of another tensor type by peeking and poking specific indices in a data parallel manner:

    template<int Index,class vobj> // Vector data parallel index peek
    auto PeekIndex(const Lattice<vobj> &lhs,int i);

    template<int Index,class vobj> // Matrix data parallel index peek
    auto PeekIndex(const Lattice<vobj> &lhs,int i,int j);

    template<int Index,class vobj>   // Vector poke
    void PokeIndex(Lattice<vobj> &lhs,const Lattice<> & rhs,int i)

    template<int Index,class vobj>   // Matrix poke
    void PokeIndex(Lattice<vobj> &lhs,const Lattice<> & rhs,int i,int j)

The inconsistent capitalisation on the letter P is due to an obscure bug in g++ that has not to our knowledge been fixed in any version. The bug was reported in 2016.

Global Reduction operations

Reduction operations for any lattice field are provided. The result is identical on each computing node that is part of the relevant Grid communicator:

  template<class vobj> 
  RealD norm2(const Lattice<vobj> &arg);

  template<class vobj> 
  ComplexD innerProduct(const Lattice<vobj> &left,const Lattice<vobj> &right);

  template<class vobj> 
  vobj sum(const Lattice<vobj> &arg)

Site local reduction operations

Internal indices may be reduced, site by site, using the following routines:

  template<class vobj>
  auto localNorm2 (const Lattice<vobj> &rhs)

  template<class vobj>
  auto localInnerProduct (const Lattice<vobj> &lhs,const Lattice<vobj> &rhs) 

Outer product

A site local outer product is defined:

  template<class ll,class rr>
  auto outerProduct (const Lattice<ll> &lhs,const Lattice<rr> &rhs)

Slice operations

Slice operations are defined to operate on one lower dimension than the full lattice. The omitted dimension is the parameter orthogdim:

  template<class vobj> 
  void sliceSum(const Lattice<vobj> &Data,
                std::vector<typename vobj::scalar_object> &result,
 	        int orthogdim);

  template<class vobj> 
  void sliceInnerProductVector( std::vector<ComplexD> & result, 
	                        const Lattice<vobj> &lhs,
	 			const Lattice<vobj> &rhs,
				int orthogdim); 

  template<class vobj> 
  void sliceNorm (std::vector<RealD> &sn,
       		  const Lattice<vobj> &rhs,
		  int orthogdim);

Data parallel expression template engine

The standard arithmetic operators and some data parallel library functions are implemented site by site on lattice types.

Operations may only ever combine lattice objects that have been constructed from the same grid pointer.

Example:

    LatticeFermionD A(&grid);
    LatticeFermionD B(&grid);
    LatticeFermionD C(&grid);
    
    A = B - C;

Such operations are said to be conformable and are the lattice are guaranteed to have the same dimensions and both MPI and SIMD decomposition because they are based on the same grid object. The conformability check is lightweight and simply requires the same grid pointers be passed to the lattice objects. The data members of the grid objects are not compared.

Conformable lattice fields may be combined with appropriate scalar types in expressions. The implemented rules follow those already documented for the tensor types.

Unary operators and functions

The following sitewise unary operations are defined:

Operation Description
operator- negate
adj Hermitian conjugate
conjugate complex conjugate
trace sitewise trace
transpose sitewise transpose
Ta take traceles anti Hermitian part
ProjectOnGroup reunitarise or orthogonalise
real take the real part
imag take the imaginary part
toReal demote complex to real
toComplex promote real to complex
timesI elementwise +i mult (0 is not multiplied)
timesMinusI elementwise -i mult (0 is not multiplied)
abs elementwise absolute value
sqrt elementwise square root
rsqrt elementwise reciprocal square root
sin elementwise sine
cos elementwise cosine
asin elementwise inverse sine
acos elementwise inverse cosine
log elementwise logarithm
exp elementwise exponentiation
operator! Logical negation of integer field
Not Logical negation of integer field

The following sitewise applied functions with additional parameters are:

  template<class obj> Lattice<obj> pow(const Lattice<obj> &rhs_i,RealD y);

  template<class obj> Lattice<obj> mod(const Lattice<obj> &rhs_i,Integer y);

  template<class obj> Lattice<obj> div(const Lattice<obj> &rhs_i,Integer y);

  template<class obj> Lattice<obj> 
  expMat(const Lattice<obj> &rhs_i, RealD alpha, Integer Nexp = DEFAULT_MAT_EXP);

Binary operators

The following binary operators are defined:

  operator+
  operator-
  operator*
  operator/

Logical are defined on LatticeInteger types:

  operator&
  operator|
  operator&&
  operator||

Ternary operator, logical operatons and where

Within the data parallel level of the API the only way to perform operations that are differentiated between sites is use predicated execution.

The predicate takes the form of a LatticeInteger which is confromable with both the iftrue and iffalse argument:

  template<class vobj,class iobj> void where(const Lattice<iobj> &pred,
                                                   Lattice<vobj> &iftrue,
                                                   Lattice<vobj> &iffalse);

This plays the data parallel analogue of the C++ ternary operator:

     a = b ? c : d;

In order to create the predicate in a coordinate dependent fashion it is often useful to use the lattice coordinates.

The LatticeCoordinate function:

    template<class iobj> LatticeCoordinate(Lattice<iobj> &coor,int dir);

fills an Integer field with the coordinate in the N-th dimension. A usage example is given

Example:

        int dir  =3;
        int block=4;
        LatticeInteger coor(FineGrid);

	LatticeCoordinate(coor,dir);
	
	result = where(mod(coor,block)==(block-1),x,z);

(Other usage cases of LatticeCoordinate include the generation of plane wave momentum phases.)

Site local fused operations

The biggest limitation of expression template engines is that the optimisation visibility is a single assignment statement in the original source code.

There is no scope for loop fusion between multiple statements. Multi-loop fusion gives scope for greater cache locality.

Two primitives for hardware aware parallel loops are provided. These will operate directly on the site objects which are expanded by a factor of the vector length (in our struct of array datatypes).

Since the mapping of sites to data lanes is opaque, these vectorised loops are only appropriate for optimisation of site local operations.

View objects

Due to an obscure aspect of the way that Nvidia handle device C++11 lambda functions, it is necessary to disable the indexing of a Lattice object.

Rather, a reference to a lattice object must be first obtained.

The reference is copyable to a GPU, and is able to be indexed on either accelerator code, or by host code.

In order to prevent people developing code that dereferences Lattice objects in a way that works on CPU compilation, but fails on GPU compilation, we have decided to remove the ability to index a lattice object on CPU code.

As a result of Nvidia’s constraints, all accesses to lattice objects are required to be made through a View object.

In the following, the type is LatticeView<vobj>, however it is wise to use the C++11 auto keyword to avoid naming the type. See code examples below.

thread_loops

The first parallel primitive is the thread_loop

Example:

  LatticeField r(grid); 
  LatticeField x(grid);
  LatticeField p(grid); 
  LatticeField mmp(grid);
  auto r_v = r.View();  
  auto x_v = x.View();
  auto p_v = p.View(); 
  auto mmp_v = mmp.View();
  thread_loop(s , r_v, {
    r_v[s] = r_v[s]   - a * mmp_v[s];
    x_v[s] = x_v[s]   + a*p_v[s];
    p_v[s] = p_v[s]*b + r_v[s];
  });

accelerator_loops

The second parallel primitive is an accelerated_loop

Example:

  LatticeField r(grid); 
  LatticeField x(grid);
  LatticeField p(grid); 
  LatticeField mmp(grid);
  auto r_v = r.View();  
  auto x_v = x.View();
  auto p_v = p.View(); 
  auto mmp_v = mmp.View();
  accelerator_loop(s , r_v, {
    r_v[s] = r_v[s]   - a * mmp_v[s];
    x_v[s] = x_v[s]   + a*p_v[s];
    p_v[s] = p_v[s]*b + r_v[s];
  });

Cshift

Site shifting operations are provided using the Cshift function:

  template<class vobj> 
  Lattice<vobj> Cshift(const Lattice<vobj> &rhs,int dimension,int shift)

This shifts the whole vector by any distance shift in the appropriate dimension.

For the avoidance of doubt on direction conventions,a positive shift moves the lattice site in the rhs to in the result.

Example (benchmarks/Benchmark_wilson.cc):

  { // Naive wilson implementation
    ref = Zero();
    for(int mu=0;mu<Nd;mu++){

      tmp = U[mu]*Cshift(src,mu,1);
      {
	auto ref_v = ref.View();
	auto tmp_v = tmp.View();
	for(int i=0;i<ref_v.size();i++){
	  ref_v[i]+= tmp_v[i] - Gamma(Gmu[mu])*tmp_v[i]; ;
	}
      }

      tmp =adj(U[mu])*src;
      tmp =Cshift(tmp,mu,-1);
      {
	auto ref_v = ref.View();
	auto tmp_v = tmp.View();
	for(int i=0;i<ref_v.size();i++){
	  ref_v[i]+= tmp_v[i] + Gamma(Gmu[mu])*tmp_v[i]; ;
	}
      }
    }
  }

CovariantCshift

Covariant Cshift operations are provided for common cases of boundary condition. These may be further optimised in future:

  template<class covariant,class gauge> 
  Lattice<covariant> CovShiftForward(const Lattice<gauge> &Link, int mu,
			   	     const Lattice<covariant> &field);

  template<class covariant,class gauge> 
  Lattice<covariant> CovShiftBackward(const Lattice<gauge> &Link, int mu,
			              const Lattice<covariant> &field);

Boundary conditions

The covariant shift routines occur in namespaces PeriodicBC and ConjugateBC. The correct covariant shift for the boundary condition is passed into the gauge actions and wilson loops via an “Impl” template policy class.

The relevant staples, plaquettes, and loops are formed by using the provided method:

    Impl::CovShiftForward
    Impl::CovShiftBackward

etc… This makes physics code transform appropriately with externally supplied rules about treating the boundary.

Example (lib/qcd/util/WilsonLoops.h):

  static void dirPlaquette(GaugeMat &plaq, const std::vector<GaugeMat> &U,
                           const int mu, const int nu) {
    // ___
    //|   |
    //|<__|
    plaq = Gimpl::CovShiftForward(U[mu],mu,
                    Gimpl::CovShiftForward(U[nu],nu,
			Gimpl::CovShiftBackward(U[mu],mu,
			   Gimpl::CovShiftIdentityBackward(U[nu], nu))));
  }

Inter-grid transfer operations

Transferring between different checkerboards of the same global lattice:

  template<class vobj> void pickCheckerboard(int cb,Lattice<vobj> &half,const Lattice<vobj> &full);
  template<class vobj> void setCheckerboard(Lattice<vobj> &full,const Lattice<vobj> &half);

These are used to set up Schur red-black decomposed solvers, for example.

Multi-grid projection between a fine and coarse grid:

 template<class vobj,class CComplex,int nbasis>
 void blockProject(Lattice<iVector<CComplex,nbasis > > &coarseData,
                   const             Lattice<vobj>   &fineData,
                   const std::vector<Lattice<vobj> > &Basis);

Multi-grid promotion to a finer grid:

  template<class vobj,class CComplex,int nbasis>
  void blockPromote(const Lattice<iVector<CComplex,nbasis > > &coarseData,
                    Lattice<vobj>   &fineData,
                    const std::vector<Lattice<vobj> > &Basis)

Support for sub-block Linear algebra:

  template<class vobj,class CComplex>
  void blockZAXPY(Lattice<vobj> &fineZ,
                  const Lattice<CComplex> &coarseA,
                  const Lattice<vobj> &fineX,
                  const Lattice<vobj> &fineY)

  template<class vobj,class CComplex>
  void blockInnerProduct(Lattice<CComplex> &CoarseInner,
                         const Lattice<vobj> &fineX,
                         const Lattice<vobj> &fineY)

  template<class vobj,class CComplex>
  void blockNormalise(Lattice<CComplex> &ip,Lattice<vobj> &fineX)

  template<class vobj>
  void blockSum(Lattice<vobj> &coarseData,const Lattice<vobj> &fineData)

  template<class vobj,class CComplex>
  void blockOrthogonalise(Lattice<CComplex> &ip,std::vector<Lattice<vobj> > &Basis)

Conversion between different SIMD layouts:

  template<class vobj,class vvobj>
  void localConvert(const Lattice<vobj> &in,Lattice<vvobj> &out)

Slices between grid of dimension N and grid of dimentions N+1:

  template<class vobj>
  void InsertSlice(const Lattice<vobj> &lowDim,Lattice<vobj> & higherDim,int slice, int orthog)

  template<class vobj>
  void ExtractSlice(Lattice<vobj> &lowDim,const Lattice<vobj> & higherDim,int slice, int orthog)

Growing a lattice by a multiple factor, with periodic replication:

  template<class vobj>
  void Replicate(Lattice<vobj> &coarse,Lattice<vobj> & fine)

That latter is useful to, for example, pre-thermalise a smaller volume and then grow the volume in HMC. It was written while debugging G-parity boundary conditions.