基于混合聚类的k-匿名数据发布算法

doi:10.16180/j.cnki.issn1007-7820.2022.12.011

电子科技 ›› 2022, Vol. 35 ›› Issue (12): 78-83.doi: 10.16180/j.cnki.issn1007-7820.2022.12.011

基于混合聚类的k-匿名数据发布算法

方凯¹,史志才^1,²,贾媛媛¹

1.上海工程技术大学电子电气工程学院,上海 201620
2.上海市信息安全综合管理技术研究重点实验室,上海 200240

收稿日期:2021-05-19 出版日期:2022-12-15 发布日期:2022-12-13
作者简介:方凯(1995-),男,硕士研究生。研究方向:网络安全、隐私保护。|史志才(1964-),男,博士,教授。研究方向:计算机网络、隐私保护、物联网与嵌入式系统等。|贾媛媛(1995-),女,硕士研究生。研究方向:隐私保护、网络信息安全。
基金资助:
国家自然科学基金(61802252)

K-Anonymity Data Publishing Algorithm Based on Hybrid Clustering

FANG Kai¹,SHI Zhicai^1,²,JIA Yuanyuan¹

1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
2. Shanghai Key Laboratory of Integrated Administration Technologies for Information Security,Shanghai 200240,China

Received:2021-05-19 Online:2022-12-15 Published:2022-12-13
Supported by:
National Natural Science Foundation of China(61802252)

摘要/Abstract

摘要：

为了减少数据发布时的信息损失,针对基于聚类的数据发布匿名方案数据可用性较低等问题,文中提出了一种基于混合聚类的k-匿名数据发布算法。相对于传统的单一聚类方法,该算法将密度聚类和划分聚类相结合,依据数据集的密度特征选取初始聚类中心点,利用划分聚类进行迭代实现最优聚类。此外,该方法剔除了数据集中的部分离群点噪声,减小了其对聚类结果的影响。针对混合型数据记录,采用k-means和k-modes结合的距离度量方式,引入桶泛化算法,减少了泛化操作造成的信息损失。实验结果表明,相较于现有方法,基于混合聚类的k-匿名数据发布算法能够有效降低数据匿名的信息损失,提高数据发布的质量。

关键词: 隐私保护, 数据发布, k-匿名, 聚类, 桶泛化算法, 混合属性, 网络安全, 信息损失

Abstract:

In order to reduce the loss of information in data publishing, a k-anonymous data publishing algorithm based on hybrid clustering is proposed to solve the problem of low data availability in existing data anonymity schemes based on clustering. Compared with the traditional single clustering method, the proposed algorithm combines partition clustering and distance clustering, selects the initial clustering center point according to the density characteristics of the data set, and uses partition clustering to achieve the optimal clustering iteratively. In addition, the proposed method eliminates part of the outlier noise in the data set to reduce its impact on the clustering results. For hybrid data records, the distance measurement method combining k-means and k-modes is adopted, and the bucket generalization algorithm is introduced to reduce the information loss caused by generalization operation. Experimental results show that compared with the existing methods, the k-anonymity data publishing algorithm based on hybrid clustering can effectively reduce the information loss of data anonymity and improve the quality of data publishing.

Key words: privacy preserving, data publishing, k-anonymity, clustering, bucket generalization algorithm, mixed attributes, network security, information loss

中图分类号:

TP309

方凯,史志才,贾媛媛. 基于混合聚类的k-匿名数据发布算法[J]. 电子科技, 2022, 35(12): 78-83.

FANG Kai,SHI Zhicai,JIA Yuanyuan. K-Anonymity Data Publishing Algorithm Based on Hybrid Clustering[J]. Electronic Science and Technology, 2022, 35(12): 78-83.

图/表 8

图1

图2

图3

图4

表1

图5

图6

图7

参考文献 19

[1]	徐波. 面向数据发布的差分隐私保护技术研究[D]. 长沙: 湖南大学, 2018.
	Xu Bo. Research on differential privacy protection technology for data publication[D]. Changsha: Hunan University, 2018.
[2]	蓝机满. 基于云计算的数据挖掘系统设计[J]. 电子科技, 2019, 32(8):70-74.
	Lan Jiman. Design of data mining system based on cloud computing[J]. Electronic Science and Technology, 2019, 32(8):70-74.
[3]	孙志冉, 苏航, 梁毅. 一种改进的K-Prototypes聚类算法[J]. 计算机工程与应用, 2020, 56(21):54-59. doi: 10.3778/j.issn.1002-8331.1912-0106
	Sun Zhiran, Su Hang, Liang Yi. Improved K-Prototypes clustering algorithm[J]. Computer Engineering and Applications, 2020, 56(21):54-59. doi: 10.3778/j.issn.1002-8331.1912-0106
[4]	才宇. 基于聚类的隐私保护技术研究[D]. 哈尔滨: 哈尔滨工程大学, 2018.
	Cai Yu. Research on privacy preservation technology based on clustering[D]. Harbin: Harbin Engineering University, 2018.
[5]	Sweeney L. K-anonymity: A model for protecting privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5):557-570. doi: 10.1142/S0218488502001648
[6]	Machanavajjhala A, Gehrke J, Kifer D, et al. L-diversity: Privacy beyond k-anonymity[C]. Atlanta: The Twenty-second International Conference on Data Engineering, 2006.
[7]	Li N, Li T, Venkatasubramanian S. T-closeness: Privacy beyond k-anonymity and l-diversity[C]. Istanbul: Proceedings of the IEEE Twenty-third International Conference on Data Engineering, 2007.
[8]	姜火文, 曾国荪, 马海英. 面向表数据发布隐私保护的贪心聚类匿名方法[J]. 软件学报, 2017, 28(2):341-351.
	Jiang Huowen, Zeng Guosun, Ma Haiying. Greedy clustering-anonymity method for privacy preservation of table data-publishing[J]. Journal of Software, 2017, 28(2):341-351.
[9]	陈虹云, 王杰华, 胡兆鹏, 等. 面向医疗数据发布的动态更新隐私保护算法[J]. 计算机科学, 2019, 46(1):206-211. doi: 10.11896／j.issn.1002-137X.2019.01.032
	Chen Hongyun, Wang Jiehua, Hu Zhaopeng, et al. Privacy preserving algorithm based on dynamic update in medical data publishing[J]. Computer Science, 2019, 46(1):206-211. doi: 10.11896／j.issn.1002-137X.2019.01.032
[10]	张王策, 范菁, 王渤茹, 等. 面向缺损数据的(α,k)-匿名模型[J]. 计算机科学, 2020, 47(S1):395-399.
	Zhang Wangce, Fan Jing, Wang Boru, et al. (α,k)-anonymized model for missing data[J]. Computer Science, 2020, 47(S1):395-399.
[11]	屈晶晶, 蔡英, 范艳芳, 等. 基于k-prototype聚类的差分隐私混合数据发布算法[J]. 计算机科学与探索, 2021, 15(1):109-118. doi: 10.3778/j.issn.1673-9418.2003048
	Qu Jingjing, Cai Ying, Fan Yanfang, et al. Differential privacy hybrid data publishing algorithm based on k-prototype clustering[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(1):109-118. doi: 10.3778/j.issn.1673-9418.2003048
[12]	Hussain S F, Haris M. A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data[J]. Expert Systems with Applications, 2019, 18(5):20-34.
[13]	Xing K, Hu C Q, Yu J G, et al. Mutual privacy preserving k-means clustering in social participatory sensing[J]. IEEE Transactions on Industrial Informatics, 2017, 13(4):2066-2076. doi: 10.1109/TII.2017.2695487
[14]	Zheng W T, Wang Z Y, Lü T T, et al. K-anonymity algorithm based on improved clustering[C]. Guangzhou: Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, 2018.
[15]	Sangam R S, Om H. An equi-biased k-prototypes algorithm for clustering mixed-type data[J]. Sādhanā, 2018, 43(3):1-12. doi: 10.1007/s12046-017-0766-x
[16]	邹云峰, 张昕, 宋世渊, 等. 基于局部密度的快速离群点检测算法[J]. 计算机应用, 2017, 37(10):2932-2937. doi: 10.11772/j.issn.1001-9081.2017.10.2932
	Zou Yunfeng, Zhang Xin, Song Shiyuan, et al. Fast outlier detection algorithm based on local density[J]. Journal of Computer Applications, 2017, 37(10):2932-2937. doi: 10.11772/j.issn.1001-9081.2017.10.2932
[17]	Li T, Li N, Zhang J, et al. Slicing: A new approach for privacy preserving data publishing[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 24(3):561-574. doi: 10.1109/TKDE.2010.236
[18]	Aggarwal G, Panigrahy R, Feder T, et al. Achieving anonymity via clustering[J]. ACM Transactions on Algorithms, 2010, 6(3):1-19.
[19]	Yuan Z, Zhang X Y, Feng S. Hybrid data-driven outlier detection based on neighborhood information entropy and its developmental measures[J]. Expert Systems with Applications, 2018, 17(5):243-257. doi: 10.1016/S0957-4174(99)00038-X

编号	属性名称	属性类型	属性值个数
0	age	数值型	73
1	work_class	分类型	7
2	education_num	数值型	16
3	marital_status	分类型	7
4	occupation	敏感属性	14
5	race	分类型	5
6	sex	分类型	2
7	native_country	分类型	41

基于混合聚类的k-匿名数据发布算法

K-Anonymity Data Publishing Algorithm Based on Hybrid Clustering

RichHTML

PDF (PC)

赞

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 19

相关文章 15

Metrics

本文评价

推荐阅读 0

[1]	张崇崇,黄亚宇. GA-BP神经网络对片烟结构的预测研究[J]. 电子科技, 2022, 35(6): 35-42.
[2]	程顺达,程颖,孙士江. 基于机器学习的肿瘤智能辅助诊断方法[J]. 电子科技, 2022, 35(5): 56-59.
[3]	金霄,吴飞,鄢松,陆雯霞,张忠艺. 基于GAWK-means的地铁车站指纹定位方法[J]. 电子科技, 2022, 35(2): 34-39.
[4]	王夏霖,阚秀,范艺璇. 基于LSTM-Attention的P300事件相关电位识别分类研究[J]. 电子科技, 2022, 35(12): 10-16.
[5]	林静,胡德敏,王揆豪. 差分隐私模糊聚类位置保护方法[J]. 电子科技, 2022, 35(11): 64-71.
[6]	徐航帆,刘丛,唐坚刚,彭敦陆. 改进地标点采样的加速谱聚类算法[J]. 电子科技, 2021, 34(5): 47-53.
[7]	杨珊珊,张大兴,郭家伟,王诗迢. 结合关键点和块优点的复制粘贴检测算法[J]. 电子科技, 2020, 33(3): 38-43.
[8]	缪冉,李菲菲,陈虬. 基于卷积神经网络与多尺度空间编码的场景识别方法[J]. 电子科技, 2020, 33(12): 54-58.
[9]	李康. 多模态特征融合的网络安全态势评估[J]. 电子科技, 2020, 33(12): 28-31.
[10]	刘琛,马驷俊,倪雪莉. 基于属性的物联网感知层访问控制方案[J]. 电子科技, 2019, 32(9): 55-59.
[11]	张长青,杨楠. 基于车联网大数据分析的实时路况检测系统[J]. 电子科技, 2019, 32(8): 66-70.
[12]	蓝机满. 基于云计算的数据挖掘系统设计[J]. 电子科技, 2019, 32(8): 70-74.
[13]	章裕润,吴飞,毛万葵. 基于WiFi-GM指纹的室内定位算法[J]. 电子科技, 2019, 32(5): 49-54.
[14]	李琪,张欣,张平康,张航. 基于Spark框架的CFSFDP改进算法[J]. 电子科技, 2019, 32(5): 38-44.
[15]	刘东伟. 基于入侵监测的网络信息安全管理技术[J]. 电子科技, 2019, 32(12): 68-71.