Acta Chimica Sinica ›› 2022, Vol. 80 ›› Issue (6): 714-723.DOI: 10.6023/A22010017 Previous Articles     Next Articles



刘雨泽a, 李昆华a, 黄佳兴a, 于曦a,b,*(), 胡文平a,b,*()   

  1. a 天津大学化学系 天津 300072
    b 天津大学天津市分子光电科学重点实验室 天津 300072
  • 投稿日期:2022-01-10 发布日期:2022-07-07
  • 通讯作者: 于曦, 胡文平
  • 基金资助:
    国家自然科学基金(21973069); 国家自然科学基金(21773169); 天津大学北洋青年学者计划(2018XRX-0007); 天津大学-青海民族大学自主创新合作基金(2021XZC-0064); 青海民族大学青藏高原资源化学与生态环境保护国家民委重点实验室开放基金(2021GKF-1002); 天津大学大学生创新创业训练计划(201910056451)

Accurate Prediction of the Boiling Point of Organic Molecules by Multi-Component Heterogeneous Learning Model

Yuze Liua, Kunhua Lia, Jiaxing Huanga, Xi Yua,b(), Wenping Hua,b()   

  1. a Department of Chemistry, Tianjin University, Tianjin 300072
    b Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Tianjin University, Tianjin 300072
  • Received:2022-01-10 Published:2022-07-07
  • Contact: Xi Yu, Wenping Hu
  • Supported by:
    National Natural Science Foundation of China(21973069); National Natural Science Foundation of China(21773169); PEIYANG Young Scholars Program of Tianjin University(2018XRX-0007); Tianjin University-Qinghai Nationalities University Innovation Cooperation Fund(2021XZC-0064); Qinghai University for Nationalities and Qinghai-Tibet Plateau Resource Chemistry and Ecological Environmental Protection Key Laboratory of National Civil Affairs Commission Open Fund(2021GKF-1002); College Student Innovation and Entrepreneurship Training Program of Tianjin University(201910056451)

Boiling point (BP) is the basic physical chemistry quantity of organic molecular liquids, and an important parameter in the chemical industry. The boiling point of organic molecules is determined by the molecular structure, showing a complex structure-property relationship. Traditional methods such as function method and group contribution method cannot cope with the prediction of BP of organic molecules of complex and diverse structures. In this study, we developed an ensemble machine learning method, combining artificial neural networks (ANN) and support vector machines (SVM), to build a multi-component model to realize the boiling point prediction of high accuracy in wide structure space. We developed three heterogeneous models: ANN based on interpretable descriptors, ANN based on correlated descriptors, and SVM based on hybrid molecular fingerprints. The three heterogeneous models were evaluated and optimized using a boiling points data set containing 4550 organic molecules of various categories. The first model (Model A) is an interpretable ANN with selected descriptors possessing physical properties such as molecular composition and molecular chain branches, and with an ANN structure of 19-14-1. The second model (Model B) is an ANN with 35-30-1 structure, of which the descriptors were obtained by using correlation coefficient selecting and bidirectional stepwise regression method. The third model (Model C) is a fingerprint SVM, which had integrated molecular fingerprint information of 1679 bit. Among the three component models, Model A had the best generalization ability and Model B had the best prediction accuracy. The heterogeneous multi-component learner combined with the three models showed a better prediction accuracy and generalization property with low overfitting than that of a single model or homogeneous model. The mean square error on test set of the multi-component model is 12 K by a 5-fold cross validation on 4550 organic molecular data sets, better than all the previous BP prediction methods. The ensemble machine learning strategy we developed here based on heterogeneous models can also be well generalized to the prediction of physical chemical properties beyond BP.

Key words: chemical informatics, machine learning, heterogeneous learning model, artificial neural network, supported vector machine, ensemble learning, organic molecular boiling point