一种改进ASTNN网络的PHP代码漏洞挖掘方法

doi:10.19665/j.issn1001-2400.2020.06.023

摘要/Abstract

摘要：

针对传统的动静态PHP漏洞挖掘技术效率低、误报率高、漏洞匹配规则过于单一且不具有泛化性的问题,以及现有的以token序列、软件度量等作为特征的神经网络模型不能很好地理解代码语义的问题,提出了一种基于ASTNN深度神经网络的PHP漏洞挖掘方法。首先,根据表达式子树的概念及PHP抽象语法树的特点定义了表达式子树划分规则;其次,根据PHP抽象语法树的特殊结构对传统ASTNN深度神经网络的编码层进行了改进,在提高模型效率的同时更好地保留了抽象语法树所包含的语义信息。最终实验结果表明,基于改进后ASTNN网络的PHP漏洞挖掘方法相对于传统的漏洞挖掘方法具有更高的准确率及召回率。改进后的ASTNN深度神经网络模型适用于PHP语言漏洞挖掘领域。

关键词: 抽象语法树, 深度学习, 循环神经网络, 漏洞挖掘

Abstract:

In order to solve the problems of low efficiency and high false positives of the traditional PHP vulnerability mining technology, a deep neural network mining method based on the ASTNN is proposed. At the same time, this method is also used to solve the problem of high false positives of the existing neural network model with the token sequence and software metrics as features. First, according to the characteristics of the PHP abstract syntax tree, the rules for dividing statement trees are defined. Second, according to the special structure of the PHP abstract syntax tree, improvements are made to the encoding layer of the traditional ASTNN deep neural network to better preserve the semantic information contained in the abstract syntax tree. Experimental results show that the PHP vulnerability mining method based on the improved ASTNN model has a higher accuracy and recall rate than the traditional method. The improved ASTNN deep neural network model is suitable for PHP vulnerability mining.

Key words: abstract syntax tree, deep learning, recurrent neural network, vulnerability mining

中图分类号:

TP311.5

胡建伟,赵伟,崔艳鹏,崔俊洁. 一种改进ASTNN网络的PHP代码漏洞挖掘方法[J]. 西安电子科技大学学报, 2020, 47(6): 164-173.

HU Jianwei,ZHAO Wei,CUI Yanpeng,CUI Junjie. PHP code vulnerability mining technology based on theimproved ASTNN[J]. Journal of Xidian University, 2020, 47(6): 164-173.

图/表 9

图1

图2

图3

表1

图4

图5

图6

图7

表2

参考文献 18

[1]	ZAPPONI C. GitHut - Programming Languages and GitHub[EB/OL]. [2019-12-17]. https://githut.info/.
[2]	Q-SUCCESS W3Techs - World Wide Web Technology Surveys [EB/OL]. [2019-12-17]. https://w3techs.com/
[3]	BACKES M, RIECK K, SKORUPPA M, et al. Efficient and Flexible Discovery of Php Application Vulnerabilities[C]// Proceedings of the 2017 2nd IEEE European Symposium on Security and Privacy. Piscataway: IEEE, 2017: 334-349.
[4]	EXPLOIT DATABASE Exploit Database Statistics[EB/OL] . [2019-12-17]. https://www.exploit-db.com/exploit-database-statistics.
[5]	YAN X X, WANG Q X, MA H T. Path Sensitive Static Analysis of Taint-style Vulnerabilities in PHP Code[C]// Proceedings of the 2017 17th IEEE International Conference on Communication Technology. Piscataway: IEEE, 2017: 1382-1386.
[6]	BUJA G, JALIL K B A, ALI F B H M, et al. Detection Model for SQL Injection Attack: An Approach for Preventing A Web Application from the SQL Injection Attack[C]// Proceedings of the 2014 IEEE Symposium on Computer Applications and Industrial Electronics. Piscataway: IEEE, 2015: 60-64.
[7]	LAL H, PAHWA G. Code Review Analysis of Software System Using Machine Learning Techniques[C]// Proceedings of the 2017 11th International Conference on Intelligent Systems and Control. Piscataway: IEEE, 2017: 8-13.
[8]	ANBIYA D R, PURWARIANTI A, ASNAR Y. Vulnerability Detection in PHP Web Application Using Lexical Analysis Approach with Machine Learning[C]// Proceedings of the 2018 5th International Conference on Data and Software Engineering. Piscataway: IEEE, 2018: 8705809.
[9]	YAMAGUCHI F, GOLDE N, ARP D, et al. Modeling and Discovering Vulnerabilities with Code Property Graphs[C]// Proceedings of the 2014 IEEE Symposium on Security and Privacy. Piscataway: IEEE, 2014: 590-604.
[10]	ALON U, ZILBERSTEIN M, LEVY O, et al. A General Path-based Representation for Predicting Program Properties[J]. ACM SIGPLAN Notices, 2018, 53(4): 404-419. doi: 10.1145/3296979.3192412
[11]	ALON U, ZILBERSTEIN M, LEVY O, et al. Code2vec: Learning Distributed Representations of Code[J]. Proceedings of the ACM on Programming Languages, 2019, 3(POPL): 1-29.
[12]	LI Y, WANG S, NGUYEN T N, et al. Improving Bug Detection Via Context-based Code Representation Learning and Attention-based Neural Networks[C]// Proceedings of the 2019 ACM on Programming Languages. New York: ACM, 2019: A162.
[13]	WEI H H, LI M. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code[C]// Proceedings of the 2017 26th International Joint Conference on Artificial Intelligence. Melbourne: International Joint Conferences on Artificial Intelligence, 2017: 3034-3040.
[14]	SHIDO Y, KOBAYASHI Y, YAMAMOTO A, et al. Automatic Source Code Summarization with Extended Tree-LSTM[C]// Proceedings of the 2019 International Joint Conference on Neural Networks. Piscataway: IEEE, 2019: 8851751.
[15]	MOU L, LI G, ZHANG L, et al. Convolutional Neural Networks over Tree Structures for Programming Language Processing[C]// Proceedings of the 2016 30th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2016: 1287-1293.
[16]	ZHANG J, WANG X, ZHANG H, et al. A Novel Neural Source Code Representation Based on Abstract Syntax Tree[C]// Proceedings of the 2019 International Conference on Software Engineering. Washington: IEEE Computer Society, 2019: 783-794.
[17]	WHITE M, TUFANO M, VENDOME C, et al. Deep Learning Code Fragments for Code Clone Detection[C]// Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2016: 87-98.
[18]	STIVALET B, FONG E. Large Scale Generation of Complex and Faulty PHP Test Cases[C]// Proceedings of the 2016 IEEE International Conference on Software Testing, Verification and Validation. Piscataway: IEEE, 2016: 409-415.

组号	漏洞	正样本	负样本	总量
1	SQL注入	912	1 824	2 736
2	命令执行	624	1 248	1 872
3	跨站脚本	4 352	5 728	10 080
4	XPath注入	1 264	2 528	3 792
5	混合漏洞	2 048	4 096	6 144

漏洞	算法	准确率/%	召回率/%	F1值
SQL注入	SVM	77.74	48.11	75.81
	LSTM	60.25	60.89	80.23
	A_ASTNN	99.64	100.0	99.53
命令执行	SVM	88.83	84.64	83.46
	LSTM	89.23	82.16	86.15
	A_ASTNN	100.0	100.0	100.0
跨站脚本	SVM	60.02	52.98	53.41
	LSTM	70.36	67.35	69.23
	A_ASTNN	94.84	88.07	93.66
Xpath注入	SVM	96.58	95.07	95.41
	LSTM	98.23	96.74	96.20
	A_ASTNN	99.20	97.89	98.93