基于Spark的改进随机森林算法

doi:10.16180/j.cnki.issn1007-7820.2019.04.013

电子科技 ›› 2019, Vol. 32 ›› Issue (4): 60-64.doi: 10.16180/j.cnki.issn1007-7820.2019.04.013

基于Spark的改进随机森林算法

孙悦,袁健

上海理工大学光电信息与计算机工程学院,上海 200093

收稿日期:2018-03-18 出版日期:2019-04-15 发布日期:2019-03-27
作者简介:孙悦(1993-),男,硕士研究生。研究方向:数据挖掘。|袁健(1971-),女,博士,副教授。研究方向:云计算安全与大数据关联和智能交通。
基金资助:
国家自然科学基金(61775139)

Improved Random Forest Algorithm Based on Spark

SUN Yue,YUAN Jian

School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology, Shanghai 210000,China

Received:2018-03-18 Online:2019-04-15 Published:2019-03-27
Supported by:
National Natural Science Foundation of China(61775139)

摘要/Abstract

摘要：

针对基于单机的经典随机森林算法无法满足海量数据处理需求的问题,文中采用Spark分布式存储计算技术设计并实现了改进的随机森林算法。首先计算特征的重要程度,将特征分为公共特征、独有特征和非重要特征;然后按顺序和比例分别在各个特征子空间中随机选择特征;最后通过Spark集群进行实验,分析改进的随机森林算法分类性能、加速比和效率。结果证实改进的算法提高了随机森林构建效率,可以用来解决海量数据挖掘问题,具有良好的可扩展性。

关键词: 随机森林, Spark, 特征空间, ReliefF算法, 高维数据, 分类模型

Abstract:

For the classical random forest algorithm based on single machine couldn't meet the demand of dealing with massive data, an improved random forest algorithm based on Spark was designed and implemented by using Spark distributed memory computing technology. Firstly, after calculating the importance of features the features were divided into public features, unique features, and non-important features;. Then, randomly features were selected in each feature subspace in order and proportion. Finally, the experiment was performed through Spark clusters to analyze the improved classification performance, speedup ratio and efficiency of the random forest algorithm. The result demonstrated that the improved algorithm could improve the efficiency of random forest construction and could be used to solve the massive data mining problem with good scalability.

Key words: random forest, spark, feature space, ReliefF algorithm, high dimensional data, classification model

中图分类号:

TP311.13

孙悦,袁健. 基于Spark的改进随机森林算法[J]. 电子科技, 2019, 32(4): 60-64.

SUN Yue,YUAN Jian. Improved Random Forest Algorithm Based on Spark[J]. Electronic Science and Technology, 2019, 32(4): 60-64.

图/表 3

参考文献 16

[1]	Kulkarni V Y, Sinha P K . Efficient learning of random forest classifier using disjoint partitioning approach[J]. Proceedings of the World Congress on Engineering, 2013,2(5):1-5.
[2]	Mi Y . Imbalanced classification based on active learning SMOTE[J]. Research Journal of Applied Sciences Engineering & Technology, 2013,5(3):944-949.
[3]	Amaratunga D Cabrera J Lee Y S . Enriched random forests[J]. Bioinformatics, 2008,24(18):2010-2014. doi: 10.1093/bioinformatics/btn356
[4]	Xu B X, Huang J Z, Williams G , et al. Classifying very high-dimensional data with random forests built from small subspaces[J]. International Journal of Data Warehousing and Mining, 2011,8(2):44-63. doi: 10.4018/jdwm.2012040103
[5]	Ye Y M, Wu Q Y, Huang J Z , et al. Stratified sampling for feature subspace selection in random forests for high dimensional data[J]. Pattern Recognition, 2013,46(3):769-787. doi: 10.1016/j.patcog.2012.09.005
[6]	Sun K, Miao W, Zhang X, et al. An improvement to feature selection of random forests on Spark [C].Chengdu:2014 IEEE 17 ^th International Conference on Computational Science and Engineering , 2014.
[7]	Wu X, Zhu X, Wu G Q . Data mining with big data[J]. IEEE Transactions on Knowledge Data Engineering, 2014,26(1):97-107. doi: 10.1109/TKDE.2013.109
[8]	Kuang L, Hao F, Yang L T , et.al. A tensor-based approach for big data representation and dimensionality reduction[J]. IEEE Transactions on Emerge Topics Computer, 2014,2(3):280-291. doi: 10.1109/TETC.2014.2330516
[9]	Zhang C, Yuan D. Fast fine-grained air quality index level prediction using random forest algorithm on cluster computing of Spark [C].Beijing: IEEE,UIC-ATC-ScalCom-CBDCom-Iop, 2015.
[10]	Dean J, Ghemawat S . MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008,51(1):107-113.
[11]	夏卫雷, 王立松 . 基于MapReduce的并行蚁群算法研究与实现[J]. 电子科技, 2013,26(2):146-149. doi: 10.3969/j.issn.1007-7820.2013.02.046
	Xia Weilei, Wang Lisong . Research on and implementation of parallel ant colony algorithm based on MapReduce[J]. Electronic Science and Technology, 2013,26(2):146-149. doi: 10.3969/j.issn.1007-7820.2013.02.046
[12]	Dimple B, Sudarshan T . IBM text analytics on Apache Spark[M]. San Francisco:Saprk Summit, 2014.
[13]	Li W, Cheng H L, Peng Y. Visualized data mining platform based on the Spark [C].Hangzhou:Proceedings of the 16 ^th System Simulation Technology and Application , 2014.
[14]	Bian H Q, Chen Y G, Du X Y . Equal-join optimization on Spark[J]. Journal of East China Normal University :Natural Science, 2014,5(1):263-270.
[15]	Zhang J, Li T, Da R . A parallel method for computing rough set approximations[J]. Information Sciences, 2012,194(5):209-223. doi: 10.1016/j.ins.2011.12.036
[16]	Zhu Weisheng, Wang Peng . Large-scale image retrieval solution based on hadoop cloud computing platform[J]. Journal of Computer Aplliction, 2014,34(3):695-699.

	平均精度/%	平均CPU耗时/ms
RF	86.885	42.159
IRFA	86.282	27.897

基于Spark的改进随机森林算法

Improved Random Forest Algorithm Based on Spark

RichHTML

PDF (PC)

赞

可视化

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献 16

相关文章 8

Metrics

本文评价

推荐阅读 10

[1]	金鑫,冯毅,尤雪汐,王佳欣. 基于机器学习的信息安全设备调配保障技术研究[J]. 电子科技, 2020, 33(8): 80-86.
[2]	刘艳文,魏赟. 基于LDA主题模型的情感分析研究[J]. 电子科技, 2020, 33(7): 12-16.
[3]	孙丽萍,张希萌,何睿,李佳琪. 基于SVM的近红外黑木耳多糖含量分类[J]. 电子科技, 2019, 32(8): 16-21.
[4]	蓝机满. 基于云计算的数据挖掘系统设计[J]. 电子科技, 2019, 32(8): 70-74.
[5]	李媛. 分布式手机信令数据采集与分析技术研究[J]. 电子科技, 2019, 32(6): 78-81.
[6]	李琪,张欣,张平康,张航. 基于Spark框架的CFSFDP改进算法[J]. 电子科技, 2019, 32(5): 38-44.
[7]	向志华,邵亚丽. 一种结合贪心选择和特征加权的高维数据聚类算法[J]. 电子科技, 2019, 32(11): 70-73.
[8]	刘靖, 赵逢禹. 高维数据降维技术及研究进展[J]. , 2018, 31(3): 36-.