Accurate Prediction of the Boiling Point of Organic Molecules by Multi-Component Heterogeneous Learning Model
Received date: 2022-01-10
Online published: 2022-03-01
Supported by
National Natural Science Foundation of China(21973069); National Natural Science Foundation of China(21773169); PEIYANG Young Scholars Program of Tianjin University(2018XRX-0007); Tianjin University-Qinghai Nationalities University Innovation Cooperation Fund(2021XZC-0064); Qinghai University for Nationalities and Qinghai-Tibet Plateau Resource Chemistry and Ecological Environmental Protection Key Laboratory of National Civil Affairs Commission Open Fund(2021GKF-1002); College Student Innovation and Entrepreneurship Training Program of Tianjin University(201910056451)
Boiling point (BP) is the basic physical chemistry quantity of organic molecular liquids, and an important parameter in the chemical industry. The boiling point of organic molecules is determined by the molecular structure, showing a complex structure-property relationship. Traditional methods such as function method and group contribution method cannot cope with the prediction of BP of organic molecules of complex and diverse structures. In this study, we developed an ensemble machine learning method, combining artificial neural networks (ANN) and support vector machines (SVM), to build a multi-component model to realize the boiling point prediction of high accuracy in wide structure space. We developed three heterogeneous models: ANN based on interpretable descriptors, ANN based on correlated descriptors, and SVM based on hybrid molecular fingerprints. The three heterogeneous models were evaluated and optimized using a boiling points data set containing 4550 organic molecules of various categories. The first model (Model A) is an interpretable ANN with selected descriptors possessing physical properties such as molecular composition and molecular chain branches, and with an ANN structure of 19-14-1. The second model (Model B) is an ANN with 35-30-1 structure, of which the descriptors were obtained by using correlation coefficient selecting and bidirectional stepwise regression method. The third model (Model C) is a fingerprint SVM, which had integrated molecular fingerprint information of 1679 bit. Among the three component models, Model A had the best generalization ability and Model B had the best prediction accuracy. The heterogeneous multi-component learner combined with the three models showed a better prediction accuracy and generalization property with low overfitting than that of a single model or homogeneous model. The mean square error on test set of the multi-component model is 12 K by a 5-fold cross validation on 4550 organic molecular data sets, better than all the previous BP prediction methods. The ensemble machine learning strategy we developed here based on heterogeneous models can also be well generalized to the prediction of physical chemical properties beyond BP.
Yuze Liu , Kunhua Li , Jiaxing Huang , Xi Yu , Wenping Hu . Accurate Prediction of the Boiling Point of Organic Molecules by Multi-Component Heterogeneous Learning Model[J]. Acta Chimica Sinica, 2022 , 80(6) : 714 -723 . DOI: 10.6023/A22010017
| [1] | Walker, J. J. Chem. Soc. 1894, 65, 193. |
| [2] | Joback, K. G.; Reid, R. C. Chem. Eng. Commun. 1987, 57, 230. |
| [3] | Wiener, H. J. Am. Chem. Soc. 1947, 69, 17. |
| [4] | Katritzky, A. R.; Kuanar, M.; Slavov, S.; Hall, C. D.; Karelson, M.; Kahn, I.; Dobchev, D. A. Chem. Rev. 2010, 110, 5714. |
| [5] | Zhu, B.; Wu, R.; Yu, X. Acta Chim. Sinica 2020, 78, 1366. (in Chinese) |
| [5] | (朱博阳, 吴睿龙, 于曦, 化学学报, 2020, 78, 1366.) |
| [6] | Wei, J.; Chu, X.; Sun, X.; Xu, K.; Deng, H.; Chen, J.; Wei, Z.; Lei, M. InfoMat 2019, 1, 338. |
| [7] | Song, Z.; Chen, X.; Meng, F.; Cheng, G.; Wang, C.; Sun, Z.; Yin, W. J. Chin. Phys. B 2020, 29, 68. |
| [8] | Wu, W.; Sun, Q. Scientia Sinica: Physica, Mechanica et Astronomica 2018, 48, 58. (in Chinese) |
| [8] | (吴炜, 孙强, 中国科学: 物理学, 力学, 天文学, 2018, 48, 58.) |
| [9] | Liu, Y. D.; Yang, Q.; Li, Y.; Zhang, L.; Luo, S. Z. Chin. J. Org. Chem. 2020, 40, 3812. (in Chinese) |
| [9] | (刘伊迪, 杨骐, 李遥, 张龙, 罗三中, 有机化学, 2020, 40, 3812.) |
| [10] | Fissa, M. R. J. Mol. Graph. Model. 2019, 87, 109. |
| [11] | Goll, E. S.; Jurs, P. C. J. Chem. Inf. Model. 1999, 39, 974. |
| [12] | Beck, B.; Breindl, A.; Clark, T. J. Chem. Infor. Comp. Sci. 2000, 40, 1046. |
| [13] | Chalk, A. J.; Beck, B.; Clark, T. J. Chem. Infor. Comp. Sci. 2001, 41, 457. |
| [14] | Gharagheizi, F.; Mirkhani, S. A.; Ilani-Kashkouli, P.; Mohammadi, A. H.; Ramjugernath, D.; Richon, D. Fluid Phase Equilib. 2013, 354, 250. |
| [15] | Yaws, C. L. Yaws' Critical Property Data for Chemical Engineers and Chemists, 2012, http://app.knovel.com/hotlink/toc/id:kpYCPDCECD/yaws-critical-property/yaws-critical-property |
| [16] | PubChem-https://pubchem.ncbi.nlm.nih.gov/ |
| [17] | Mauri, A. Ecotoxicological QSARs, Methods in Pharmacology and Toxicology, Roy, K., New York, 2020, pp. 801-820. |
| [18] | Kubic, W. L.; Jenkins, R. W.; Moore, C. M.; Semelsberger, T. A.; & Sutton, A. D. Ind. Eng. Chem. Res. 2017, 56, 12236. |
| [19] | Kier, L. B.; Hall, L. H. In Molecular ConnectiVity in Chemistry and Drug Research, New York, 1976, pp. 27-39. |
| [20] | Dash, M.; Liu, H. Intell. Data Anal., 1997, 1, 131. |
| [21] | Roy, P. P.; Leonard, J. T.; Roy, K. Chemometr. Intell. Lab. Syst. 2008, 90, 31. |
| [22] | MacKay, J. C. Neural Comput. 1992, 4, 415 |
| [23] | Zhou, Z. H. Ensemble Methods: Foundations and Algorithms, Chapman and Hall/CRC, 2012, p. 129 |
| [24] | Zhou, L.; Wang, B.; Jiang, J.; Pan, Y.; & Wang, Q. Chemometr. Intell. Lab. Syst. 2017, 167, 190. |
| [25] | Robert, T. J. R. Stat. Soc. Ser. A 1996, 58, 267. |
| [26] | Sheela, K.; Deepa, S. N. Math. Probl. Eng. 2013, 2013, 11. |
| [27] | Eriksson, L.; Jaworska, J.; Worth, A. P.; Cronin, M. T. D.; McDowell, R. M.; Gramatica, P. Environ. Health Perspect. 2003, 111, 1361. |
| [28] | Needham, D. E.; Wei, I. C.; Seybold, P. G. J. Am. Chem. Soc. 1988, 110, 4186. |
| [29] | Balaban, A. T.; Ciubotariu, D.; Medeleanu, M. J. Chem. Infor. Comp. Sci. 1991, 313, 517. |
| [30] | Stanton, D. T. J. Chem. Infor. Comp. Sci. 2000, 40, 81. |
| [31] | Hall, L. H.; Kier, L. B. J. Chem. Infor. Comp. Sci. 1995, 35, 1039. |
| [32] | Randić, M.; Balaban, A. T.; Basak, S. J. Chem. Infor. Comp. Sci. 2001, 41, 593. |
| [33] | Katritzky, A. R.; Mu, L.; Lobanov, V. S.; Karelson, M. J. Phys. Chem. 1996, 100, 10400. |
| [34] | Zhou, C. Y.; Nie, C. M.; Li, S.; Li, Z. H. J. Comput. Chem. 2007, 28, 2413. |
| [35] | Katritzky, A. R.; Lobanov, V. S.; Karelson, M. J. Chem. Infor. Comp. Sci. 1998, 38, 28. |
| [36] | Varamesh, A.; Hemmati-Sarapardeh, A.; Dabir, B.; Mohammadi, A. H. J. Mol. Liq. 2017, 242, 59. |
| [37] | Sola, D.; Ferri, A.; Banchero, M.; Manna, L.; Sicardi, S. Fluid Phase Equilib. 2008, 263, 33. |
/
| 〈 |
|
〉 |