Acta Chimica Sinica ›› 2021, Vol. 79 ›› Issue (5): 653-657.DOI: 10.6023/A21020044 Previous Articles     Next Articles

Communication

基于GPU的Hartree-Fock与密度泛函算法及程序

王岩a, 田英齐b,c, 金钟b,*(), 索兵兵a,*()   

  1. a 西北大学 现代物理研究所 陕西省理论物理前沿重点实验室 西安 710127
    b 中国科学院计算机网络信息中心 北京 100190
    c 中国科学院大学 北京 100049
  • 投稿日期:2021-02-03 发布日期:2021-03-30
  • 通讯作者: 金钟, 索兵兵
  • 基金资助:
    项目受国家重点研发计划(2017YFB0203404); 国家自然科学基金(21873077)

Hartree-Fock and Density Functional Calculations on Graphics Processing Unit

Yan Wanga, Yingqi Tianb,c, Zhong Jinb,*(), Bingbing Suoa,*()   

  1. a Institute of Modern Physics, Shaanxi Key Laboratory for Theoretical Physics Frontiers, Northwest University, Xi'an 710127, China
    b Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    c University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2021-02-03 Published:2021-03-30
  • Contact: Zhong Jin, Bingbing Suo
  • About author:
    *E-mail:
  • Supported by:
    National Key R&D Program of China(2017YFB0203404); National Natural Science Foundation of China(21873077)

Graphics processing units (GPUs) have become a promising architecture to tackle many computational bottlenecks in quantum chemistry calculations. Herein, we present our new development on using GPU to accelerate Hartree-Fock (HF) and density functional theory (DFT) calculations in Beijing Density Functional (BDF) Package. Our program utilizes the OpenCL platform and thus can execute on a variety of computing devices from different companies as NVIDIA and AMD. All time-consuming steps in HF/DFT, such as calculation of electron repulsion integrals (ERIs), the formation of the Fock matrix, and the exchange-correlation (XC) functional related works, have been implemented on the GPU. In our algorithm, the coulomb- and exchange-matrix are calculated directly on GPU by contracting the primitive ERIs with the density matrix. The 1T1PI (1 thread 1 primitive integral) algorithm in which each thread evaluates one primitive ERI, is used to schedule the computational tasks on GPU. To achieve this task, the primitive Gaussian basis shell pairs μν are first prescreened and sorted according to their values. The Gaussian product theorem (GPT) is applied to each shell pairs and the intermediate values are calculated and copied into the GPU memory for further use. Then, the one-dimensional mapping is used to assign 32 work items (threads) into one workgroup to calculate the J matrix element and the permutation symmetry of the primitive ERIs is fully utilized. To calculate the K matrix, two-dimensional mapping is used and every 64 work items are packed into one workgroup. Permutation symmetry of exchanging the bra pair μλ and the ket pair νσ is ignored for reducing the expensive commutation between different workgroups on GPU. After a batch of coulomb- or exchange-matrix elements are computed on the GPU, the results are copied back to the CPU and accumulated to the Fock matrix. The XC terms are calculated through a numerical procedure due to the complex form of the XC functionals. We first pack the numerical grids as batches in which one batch has 128 grids. Then the none zero Gaussian basis shells on each grid batch are sifted out. The grid batches and the basis function sieving indices are duplicated on CPU and GPU memory to avoid unnecessary communication between CPU and GPU. The computational tasks are scheduled dynamically according to the grid batches on GPU. All steps as calculating the numerical grids and their weights, electron density and density gradient, the XC functional and its derivative, and the XC energy and the matrix elements of the XC potential, are optimized step by step on GPU. All calculations are carried out in 64-bit double-precision accuracy to achieve the same numerical precision as on the CPU. Benchmark calculations are carried out on several different GPUs from NVIDIA and AMD for assessing the performance of our code. The benchmark result indicates that the algorithm implemented on the GPU can achieve up to 148-fold speedup over a serial CPU implementation, and the total energy calculated on the GPU is as accurate as the resulting calculated on the CPU.

Key words: GPU, OpenCL, Hartree-Fock, density functional theory, direct self-consistent-field calculation