多组件学习器实现有机分子沸点的精准预测
收稿日期: 2022-01-10
网络出版日期: 2022-03-01
基金资助
国家自然科学基金(21973069); 国家自然科学基金(21773169); 天津大学北洋青年学者计划(2018XRX-0007); 天津大学-青海民族大学自主创新合作基金(2021XZC-0064); 青海民族大学青藏高原资源化学与生态环境保护国家民委重点实验室开放基金(2021GKF-1002); 天津大学大学生创新创业训练计划(201910056451)
Accurate Prediction of the Boiling Point of Organic Molecules by Multi-Component Heterogeneous Learning Model
Received date: 2022-01-10
Online published: 2022-03-01
Supported by
National Natural Science Foundation of China(21973069); National Natural Science Foundation of China(21773169); PEIYANG Young Scholars Program of Tianjin University(2018XRX-0007); Tianjin University-Qinghai Nationalities University Innovation Cooperation Fund(2021XZC-0064); Qinghai University for Nationalities and Qinghai-Tibet Plateau Resource Chemistry and Ecological Environmental Protection Key Laboratory of National Civil Affairs Commission Open Fund(2021GKF-1002); College Student Innovation and Entrepreneurship Training Program of Tianjin University(201910056451)
沸点(BP)是有机分子液体的基本物理化学量, 也是化学工业生产中的重要参数. 有机分子的沸点由分子结构决定, 呈现复杂的结构-沸点关系, 函数法(Function Method)、基团贡献法(Group Contribution Method)等传统方法无法应对复杂多样有机分子结构的预测, 应用范围狭窄, 预测精度低. 本研究中, 我们利用基于人工神经网络(ANN)和支持向量机(SVM)的多组件学习器实现有机分子沸点的精准预测. 我们构建了基于可解释性描述符的ANN、基于相关性描述符的ANN及基于复合分子指纹的SVM三个异质模型, 并通过包含4550个各种类别的有机分子沸点的数据集进行训练得到了三个异质性学习器, 最后集成三个学习器对有机分子沸点进行预测. 相比于传统方法和此前的定量结构性质关系(QSPR)模型, 多组件模型结合了三种模型的优点, 展现出很好的预测精度和泛化能力以及低的过拟合, 实现了对多种类型有机分子的沸点的有效预测.
刘雨泽 , 李昆华 , 黄佳兴 , 于曦 , 胡文平 . 多组件学习器实现有机分子沸点的精准预测[J]. 化学学报, 2022 , 80(6) : 714 -723 . DOI: 10.6023/A22010017
Boiling point (BP) is the basic physical chemistry quantity of organic molecular liquids, and an important parameter in the chemical industry. The boiling point of organic molecules is determined by the molecular structure, showing a complex structure-property relationship. Traditional methods such as function method and group contribution method cannot cope with the prediction of BP of organic molecules of complex and diverse structures. In this study, we developed an ensemble machine learning method, combining artificial neural networks (ANN) and support vector machines (SVM), to build a multi-component model to realize the boiling point prediction of high accuracy in wide structure space. We developed three heterogeneous models: ANN based on interpretable descriptors, ANN based on correlated descriptors, and SVM based on hybrid molecular fingerprints. The three heterogeneous models were evaluated and optimized using a boiling points data set containing 4550 organic molecules of various categories. The first model (Model A) is an interpretable ANN with selected descriptors possessing physical properties such as molecular composition and molecular chain branches, and with an ANN structure of 19-14-1. The second model (Model B) is an ANN with 35-30-1 structure, of which the descriptors were obtained by using correlation coefficient selecting and bidirectional stepwise regression method. The third model (Model C) is a fingerprint SVM, which had integrated molecular fingerprint information of 1679 bit. Among the three component models, Model A had the best generalization ability and Model B had the best prediction accuracy. The heterogeneous multi-component learner combined with the three models showed a better prediction accuracy and generalization property with low overfitting than that of a single model or homogeneous model. The mean square error on test set of the multi-component model is 12 K by a 5-fold cross validation on 4550 organic molecular data sets, better than all the previous BP prediction methods. The ensemble machine learning strategy we developed here based on heterogeneous models can also be well generalized to the prediction of physical chemical properties beyond BP.
[1] | Walker, J. J. Chem. Soc. 1894, 65, 193. |
[2] | Joback, K. G.; Reid, R. C. Chem. Eng. Commun. 1987, 57, 230. |
[3] | Wiener, H. J. Am. Chem. Soc. 1947, 69, 17. |
[4] | Katritzky, A. R.; Kuanar, M.; Slavov, S.; Hall, C. D.; Karelson, M.; Kahn, I.; Dobchev, D. A. Chem. Rev. 2010, 110, 5714. |
[5] | Zhu, B.; Wu, R.; Yu, X. Acta Chim. Sinica 2020, 78, 1366. (in Chinese) |
[5] | (朱博阳, 吴睿龙, 于曦, 化学学报, 2020, 78, 1366.) |
[6] | Wei, J.; Chu, X.; Sun, X.; Xu, K.; Deng, H.; Chen, J.; Wei, Z.; Lei, M. InfoMat 2019, 1, 338. |
[7] | Song, Z.; Chen, X.; Meng, F.; Cheng, G.; Wang, C.; Sun, Z.; Yin, W. J. Chin. Phys. B 2020, 29, 68. |
[8] | Wu, W.; Sun, Q. Scientia Sinica: Physica, Mechanica et Astronomica 2018, 48, 58. (in Chinese) |
[8] | (吴炜, 孙强, 中国科学: 物理学, 力学, 天文学, 2018, 48, 58.) |
[9] | Liu, Y. D.; Yang, Q.; Li, Y.; Zhang, L.; Luo, S. Z. Chin. J. Org. Chem. 2020, 40, 3812. (in Chinese) |
[9] | (刘伊迪, 杨骐, 李遥, 张龙, 罗三中, 有机化学, 2020, 40, 3812.) |
[10] | Fissa, M. R. J. Mol. Graph. Model. 2019, 87, 109. |
[11] | Goll, E. S.; Jurs, P. C. J. Chem. Inf. Model. 1999, 39, 974. |
[12] | Beck, B.; Breindl, A.; Clark, T. J. Chem. Infor. Comp. Sci. 2000, 40, 1046. |
[13] | Chalk, A. J.; Beck, B.; Clark, T. J. Chem. Infor. Comp. Sci. 2001, 41, 457. |
[14] | Gharagheizi, F.; Mirkhani, S. A.; Ilani-Kashkouli, P.; Mohammadi, A. H.; Ramjugernath, D.; Richon, D. Fluid Phase Equilib. 2013, 354, 250. |
[15] | Yaws, C. L. Yaws' Critical Property Data for Chemical Engineers and Chemists, 2012, http://app.knovel.com/hotlink/toc/id:kpYCPDCECD/yaws-critical-property/yaws-critical-property |
[16] | PubChem-https://pubchem.ncbi.nlm.nih.gov/ |
[17] | Mauri, A. Ecotoxicological QSARs, Methods in Pharmacology and Toxicology, Roy, K., New York, 2020, pp. 801-820. |
[18] | Kubic, W. L.; Jenkins, R. W.; Moore, C. M.; Semelsberger, T. A.; & Sutton, A. D. Ind. Eng. Chem. Res. 2017, 56, 12236. |
[19] | Kier, L. B.; Hall, L. H. In Molecular ConnectiVity in Chemistry and Drug Research, New York, 1976, pp. 27-39. |
[20] | Dash, M.; Liu, H. Intell. Data Anal., 1997, 1, 131. |
[21] | Roy, P. P.; Leonard, J. T.; Roy, K. Chemometr. Intell. Lab. Syst. 2008, 90, 31. |
[22] | MacKay, J. C. Neural Comput. 1992, 4, 415 |
[23] | Zhou, Z. H. Ensemble Methods: Foundations and Algorithms, Chapman and Hall/CRC, 2012, p. 129 |
[24] | Zhou, L.; Wang, B.; Jiang, J.; Pan, Y.; & Wang, Q. Chemometr. Intell. Lab. Syst. 2017, 167, 190. |
[25] | Robert, T. J. R. Stat. Soc. Ser. A 1996, 58, 267. |
[26] | Sheela, K.; Deepa, S. N. Math. Probl. Eng. 2013, 2013, 11. |
[27] | Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T. D.; McDowell, R. M.; Gramatica, P. Environ. Health Perspect. 2003, 111, 1361. |
[28] | Needham, D. E.; Wei, I. C.; Seybold, P. G. J. Am. Chem. Soc. 1988, 110, 4186. |
[29] | Balaban, A. T.; Ciubotariu, D.; Medeleanu, M. J. Chem. Infor. Comp. Sci. 1991, 313, 517. |
[30] | Stanton, D. T. J. Chem. Infor. Comp. Sci. 2000, 40, 81. |
[31] | Hall, L. H.; Kier, L. B. J. Chem. Infor. Comp. Sci. 1995, 35, 1039. |
[32] | Randić, M.; Balaban, A. T.; Basak, S. J. Chem. Infor. Comp. Sci. 2001, 41, 593. |
[33] | Katritzky, A. R.; Mu, L.; Lobanov, V. S.; Karelson, M. J. Phys. Chem. 1996, 100, 10400. |
[34] | Zhou, C. Y.; Nie, C. M.; Li, S.; Li, Z. H. J. Comput. Chem. 2007, 28, 2413. |
[35] | Katritzky, A. R.; Lobanov, V. S.; Karelson, M. J. Chem. Infor. Comp. Sci. 1998, 38, 28. |
[36] | Varamesh, A.; Hemmati-Sarapardeh, A.; Dabir, B.; Mohammadi, A. H. J. Mol. Liq. 2017, 242, 59. |
[37] | Sola, D.; Ferri, A.; Banchero, M.; Manna, L.; Sicardi, S. Fluid Phase Equilib. 2008, 263, 33. |
/
〈 |
|
〉 |