Christopher Rodrigues, John A. Stratton
Some molecular modeling tasks require a high-resolution map of the electrostatic potential field produced by charged atoms distributed throughout a volume . Cutoff-limited Coulombic Potential (CUTCP) computes a shortrange component of this map, in which the potential at a given map point comes only from atoms within a cutoff radius of 12A° . In a complete application, this would be added to a long-range component computed with a less computationally demanding algorithm. In a simple, sequential implementation, each atom is visited in sequence, and electropotential contributions made by a visited atom are accumulated into all output cells within the cutoff distance before proceeding to the next atom.
In an accelerated implementation, the atom data is first presorted into a spatial data structure as follows. The atom-filled volume is partitioned into a 3D uniform grid of cells, and atoms are placed into a 3D array at an index determined by which cell they occupy in the space. The array has capacity for up to eight atoms in one cell; excess atoms are processed separately on the CPU. For biomolecules, where the atom density is uniform and close to the density of water, this data structure utilizes memory efficiently. This presorting step is performed on the CPU.
In the accelerated kernel, an electrostatic potential value at a point is computed by scanning the contents of all cells within that point's cutoff radius, finding the atoms in those cells that are actually within the cutoff radius, and accumulating the potentials resulting from those atoms. In the most optimized versions, to reduce redundant computation, each thread computes the potential at multiple output points. To reduce memory bandwidth, one cell's worth of atom data at a time is loaded into local memory by a work-group, where it can be reused by multiple threads. All threads in a work-group scan the same set of atoms. This greatly reduces memory traffic at the cost of increased computation, since threads scan more atoms that do not contribute to the final calculation.
The optimized CUTCP application is compute-bound. Unlike the other compute-bound benchmarks, this kernel achieves high computational throughput partly at the cost of performing redundant computation. The percentage of redundant computation is the primary performance limiter for the hardware configurations we have studied.
The benchmark is contributed from the Parboil benchmark suite. For increased runtime, the SPEC benchmark simulates a time-averaged CutCP computation, which is a frequently-used analysis, by repeatedly computing and averaging the potential field for the fixed atoms.
118.cutcp's input consists of PQR molecular structure file, defining the size of the simulated area and the explicit molecules contained within it.
118.cutcp outputs the potential field as a list of real values for each cell.
The output file cutcp.out contains detailed timing information about the run. It also shows which device was selected along with what devices where available to OpenCL. It also contains status updates during the run.
Last updated: $Date: 2014-02-03 16:05:20 -0500 (Mon, 03 Feb 2014) $