化学学报 ›› 2022, Vol. 80 ›› Issue (6): 714-723.DOI: 10.6023/A22010017 上一篇    下一篇

研究论文

多组件学习器实现有机分子沸点的精准预测

刘雨泽a, 李昆华a, 黄佳兴a, 于曦a,b,*(), 胡文平a,b,*()   

  1. a 天津大学化学系 天津 300072
    b 天津大学天津市分子光电科学重点实验室 天津 300072
  • 投稿日期:2022-01-10 发布日期:2022-07-07
  • 通讯作者: 于曦, 胡文平
  • 基金资助:
    国家自然科学基金(21973069); 国家自然科学基金(21773169); 天津大学北洋青年学者计划(2018XRX-0007); 天津大学-青海民族大学自主创新合作基金(2021XZC-0064); 青海民族大学青藏高原资源化学与生态环境保护国家民委重点实验室开放基金(2021GKF-1002); 天津大学大学生创新创业训练计划(201910056451)

Accurate Prediction of the Boiling Point of Organic Molecules by Multi-Component Heterogeneous Learning Model

Yuze Liua, Kunhua Lia, Jiaxing Huanga, Xi Yua,b(), Wenping Hua,b()   

  1. a Department of Chemistry, Tianjin University, Tianjin 300072
    b Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Tianjin University, Tianjin 300072
  • Received:2022-01-10 Published:2022-07-07
  • Contact: Xi Yu, Wenping Hu
  • Supported by:
    National Natural Science Foundation of China(21973069); National Natural Science Foundation of China(21773169); PEIYANG Young Scholars Program of Tianjin University(2018XRX-0007); Tianjin University-Qinghai Nationalities University Innovation Cooperation Fund(2021XZC-0064); Qinghai University for Nationalities and Qinghai-Tibet Plateau Resource Chemistry and Ecological Environmental Protection Key Laboratory of National Civil Affairs Commission Open Fund(2021GKF-1002); College Student Innovation and Entrepreneurship Training Program of Tianjin University(201910056451)

沸点(BP)是有机分子液体的基本物理化学量, 也是化学工业生产中的重要参数. 有机分子的沸点由分子结构决定, 呈现复杂的结构-沸点关系, 函数法(Function Method)、基团贡献法(Group Contribution Method)等传统方法无法应对复杂多样有机分子结构的预测, 应用范围狭窄, 预测精度低. 本研究中, 我们利用基于人工神经网络(ANN)和支持向量机(SVM)的多组件学习器实现有机分子沸点的精准预测. 我们构建了基于可解释性描述符的ANN、基于相关性描述符的ANN及基于复合分子指纹的SVM三个异质模型, 并通过包含4550个各种类别的有机分子沸点的数据集进行训练得到了三个异质性学习器, 最后集成三个学习器对有机分子沸点进行预测. 相比于传统方法和此前的定量结构性质关系(QSPR)模型, 多组件模型结合了三种模型的优点, 展现出很好的预测精度和泛化能力以及低的过拟合, 实现了对多种类型有机分子的沸点的有效预测.

关键词: 化学信息学, 机器学习, 异质学习模型, 人工神经网络, 支持向量机, 集成学习, 有机分子沸点

Boiling point (BP) is the basic physical chemistry quantity of organic molecular liquids, and an important parameter in the chemical industry. The boiling point of organic molecules is determined by the molecular structure, showing a complex structure-property relationship. Traditional methods such as function method and group contribution method cannot cope with the prediction of BP of organic molecules of complex and diverse structures. In this study, we developed an ensemble machine learning method, combining artificial neural networks (ANN) and support vector machines (SVM), to build a multi-component model to realize the boiling point prediction of high accuracy in wide structure space. We developed three heterogeneous models: ANN based on interpretable descriptors, ANN based on correlated descriptors, and SVM based on hybrid molecular fingerprints. The three heterogeneous models were evaluated and optimized using a boiling points data set containing 4550 organic molecules of various categories. The first model (Model A) is an interpretable ANN with selected descriptors possessing physical properties such as molecular composition and molecular chain branches, and with an ANN structure of 19-14-1. The second model (Model B) is an ANN with 35-30-1 structure, of which the descriptors were obtained by using correlation coefficient selecting and bidirectional stepwise regression method. The third model (Model C) is a fingerprint SVM, which had integrated molecular fingerprint information of 1679 bit. Among the three component models, Model A had the best generalization ability and Model B had the best prediction accuracy. The heterogeneous multi-component learner combined with the three models showed a better prediction accuracy and generalization property with low overfitting than that of a single model or homogeneous model. The mean square error on test set of the multi-component model is 12 K by a 5-fold cross validation on 4550 organic molecular data sets, better than all the previous BP prediction methods. The ensemble machine learning strategy we developed here based on heterogeneous models can also be well generalized to the prediction of physical chemical properties beyond BP.

Key words: chemical informatics, machine learning, heterogeneous learning model, artificial neural network, supported vector machine, ensemble learning, organic molecular boiling point