This thesis presents a comparison of a GPU implementation of the Conjugate Residual method as a sequence of generic library kernels against implementations ofthe method with custom kernels to expose the performance gains of a keyoptimization strategy, kernel fusion, for memory-bound operations which is to makeefficient reuse of the processed data.
For massive MIMO the iterative solver is to be employed at the linear detection stageto overcome the computational bottleneck of the matrix inversion required in theequalization process, which is 𝒪(𝑛3) for direct solvers. A detailed analysis of howone more of the Krylov subspace methods that is feasible for massive MIMO can beimplemented on a GPU as a unified kernel is given.
Further, to show that kernel fusion can improve the execution performance not onlywhen the input data is large matrices-vectors as in scientific computing but also inthe case of massive MIMO and possibly similar cases where the input data is a largenumber of small matrices-vectors that must be processed in parallel.In more details, focusing on the small number of iterations required for the solver toachieve a close enough approximation of the exact solution in the case of massiveMIMO, and the case where the number of users matches the size of a warp, twodifferent approaches that allow to fully unroll the algorithm and gradually fuse allthe separate kernels into a single, until reaching a top-down hardcodedimplementation are proposed and tested.
Targeting to overcome the algorithms computational burden which is the matrixvector product, further optimization techniques such as two ways to utilize the faston-chip memories, preloading the matrix in shared memory and preloading thevector in shared memory, are tested and proposed to achieve high efficiency andhigh parallelism.