化学学报 ›› 2010, Vol. 68 ›› Issue (11): 1137-1142. 上一篇    下一篇

研究论文

支持矢量机和线性判别分析对细胞穿透肽的识别

陈国华1,2,夏之宁*,1,陆瑶3   

  1. (1重庆大学生物工程学院 重庆 400030)
    (2四川理工学院化学与制药工程学院 自贡 643000)
    (3四川理工学院材料与化学工程系 自贡 643000)
  • 投稿日期:2009-11-26 修回日期:2010-01-28 发布日期:2010-02-11
  • 通讯作者: 陈国华 E-mail:chgh29@163.com
  • 基金资助:

    国家自然科学基金资助项目(No. 20775096)

Prediction of Cell-Penetrating Peptides Using both Support Vector Machine and Linear Discriminant Analysis

Chen Guohua1,2 Xia Zhining*,1 Lu Yao3   

  1. 1 College of Bioengineering, Chongqing University, Chongqing 400030)
    (2 School of Chemistry and Pharmaceutical Engineering, Sichuan University of Science and Engineering, Zigong 643000)
    (3 College of Materials and Chemical Engineering, Sichuan University of Science and Engineering, Zigong 643000
  • Received:2009-11-26 Revised:2010-01-28 Published:2010-02-11
  • Contact: Guohua Chen E-mail:chgh29@163.com

选取25条CPP和16条非CPP作为训练集样本, 以61条CPP和21条非CPP为预测集样本. 利用氨基酸的z-Scale对肽链进行编码, 分别使用原始72个自交叉协方差变量和它们的主成分矢量进行线性判别(LDA)和支持矢量机(SVM)分类研究. 当采用LDA方法时, 对于训练集的预测以及它们的留一法交互检验, 均获得比较优越的结果, 但对预测集的预测总的识别率的最优结果仅为57.3%. 分别利用主成分和原始变量集作为SVM的输入建立的非线性识别模型, 对训练集的总识别率分别为85.4%和100%, 留一法交互检验的总识别率分别为80.5%和75.6%, 对预测集的最优总识别正确率为74.4%. 识别结果表明SVM能够比较好的提取原始变量间的细微模式变化, 对CPP总的识别结果优于LDA.

关键词: 细胞穿透肽, 支持矢量机, 线性判别, z-Scale, QSAR

In order to identify new potential CPPs, two methods, fisher's linear discriminant analysis (LDA) and support vector machine (SVM), have used to construct two classifiers. We have identified 123 known natural CPPs from the literature and used them to construct 2 data sets, the training set with 25 CPPs and 16 non-CPPs and the test set with 61 CPPs and 21 non-CPPs. The auto cross covariances (ACCs) by describing each amino acid by principal properties (z-scales) and their main compounds were used to construct classifiers, respectively. The obtained models, using fisher's LDA, were only able to classify correctly 57.3% on test sets, whereas these models showed large classification rates on the training sets in training and cross-validation procedures. The classification rates using SVM tool were 100% (75.6%) and 85.4% (80.5%) on the training test in training (Loo-cross-validation), when 72 ACCs and their main components were used for classification. The best result for SVM classification on test set is 74.4% using 72 ACCs. These results validate that the SVM can extract the minor change in variables. The SVM's model is better than LDA model.

Key words: cell-penetrating peptide, support vector machine, linear discriminant analysis, z-scale, QSAR