化学学报 ›› 2011, Vol. 69 ›› Issue (10): 1232-1238. 上一篇    下一篇

研究论文

PLS变量筛选法用于有机物透聚乙烯膜性能QSAR研究

张永红1,2,3, 刘树深*,2, 肖乾芬2, 覃礼堂2, 夏之宁*,3   

  1. (1重庆医科大学药学院 重庆 400016)
    (2同济大学环境科学与工程学院 长江水环境教育部重点实验室 上海 200092)
    (3重庆大学生物工程学院 重庆 400030)
  • 投稿日期:2010-07-26 修回日期:2010-12-30 发布日期:2011-01-17
  • 通讯作者: 刘树深 E-mail:ssliuhl@263.net
  • 基金资助:

    区域饮用水源优化配置与水质改善技术集成与示范

PLS Variable Selection Procedure in QSAR Study on the Performance of Organic Compounds Through Polyethylene Membrane

Zhang Yonghong1,2,3|Liu Shushen*,2|Xiao Qianfen2|Qin Litang2|Xia Zhining*,3   

  1. (1 College of Pharmaceutical Sciences, Chongqing Medical University, Chongqing 400016)
    (2 Key Laboratory of Yangtze River Water Environment, Ministry of Education, College of Environmental Science and Engineering, Tongji University, Shanghai 200092)
    (3 College of Bioengineering, Chongqing University, Chongqing 400030)
  • Received:2010-07-26 Revised:2010-12-30 Published:2011-01-17
  • Contact: Shu-Shen LIU E-mail:ssliuhl@263.net

随着大量分子描述符应用于QSAR/QSPR, 如何筛选出具有良好稳定性和预测能力的描述符集, 成为亟待解决的一个瓶颈问题. 将63个有机化合物的1664个描述符经过初步预选后, 利用偏最小二乘(PLS)方法进行变量筛选, 获得42个重要描述符|随机选择43个有机物, 针对透聚乙烯膜性能进行训练研究, 得优良估计能力和良好稳定性模型 (A=6, r2=0.9647, RMSE=0.213, q2=0.8364, RMSV=0.467)|对模型外部20个有机物进行预测, 表明模型具有良好预测能力( =0.9306, RMSP=0.326). PLS变量筛选法可以快速有效地筛选与活性密切相关的重要描述符, 进而构建具有良好稳定性和预测能力的QSAR模型.

关键词: 变量筛选, 偏最小二乘(PLS), 变量投影性指标(VIP), 定量构效关系(QSAR), 透聚乙烯膜性能

Following the large number of descriptors used in QSAR/QSPR, it has become a bottleneck problem how to choose the descriptor set which can be used to develop a good stable and predictable model. In this work, the partial least squares (PLS) method was used to screen the important descriptors. The 42 molecular descriptors were selected from an original pool of 1664 descriptors of 63 organic compounds. The PLS regression model between 42 descriptors and the logarithm of the permeability coefficients of various organic compounds through low-density polyethylene was developed and validated by the variable selection and modeling based on prediction (VSMP) technique. It has been found that PLS regression model has good quality, r2=0.9647 and q2=0.8364 for the training set of 43 samples and =0.9306 for the test set of 20 compounds. Using PLS variable selection procedure, it is possible to rapidly and effectively select the important variables closely related with the activity of compounds and construct a model with good stability and predictability.

Key words: variable selection, partial least squares (PLS), variable importance in projection (VIP), QSAR, permeability coefficient