J4 ›› 2014, Vol. 41 ›› Issue (3): 123-130.doi: 10.3969/j.issn.1001-2400.2014.03.018

• 研究论文 • 上一篇    下一篇

结合紧密度和分散度的近邻亲和相似度函数

李娟1,2;王宇平1   

  1. (1. 西安电子科技大学 计算机学院,陕西 西安  710071;
    2. 陕西师范大学 远程教育学院,陕西 西安  710062)
  • 收稿日期:2013-03-13 出版日期:2014-06-20 发布日期:2014-07-10
  • 通讯作者: 李娟
  • 作者简介:李娟(1979-),女,讲师,西安电子科技大学博士研究生,E-mail: ally_2004@126.com.
  • 基金资助:

    国家自然科学基金资助项目(61272119)

New nearest neighbor affinity similarity function based on separation and compactness between samples

LI Juan1,2;WANG Yuping1   

  1. (1. School of Computer Science and Technology, Xidian Univ., Xi'an  710071, China;
    2. School of Distance Education, Shaanxi Normal Univ., Xi'an  710062, China)
  • Received:2013-03-13 Online:2014-06-20 Published:2014-07-10
  • Contact: LI Juan

摘要:

针对传统距离或相似度度量未考虑个体样本对整体样本集影响的情况,对K近邻算法提出了一种相似度改进策略.首先提出了一种新的亲和距离函数,以样本对整体样本集的紧密度和分散度为关注点;其次在亲和距离函数的基础上,提出了一种新的基于紧密度和分散度的亲和相似度函数,并将其作为K近邻算法相似度度量函数;最后通过理论分析及18个数值类型UCI数据集,以5交叉验证模式对所提出亲和相似度函数与传统距离和相似度函数进行验证对比.实验表明,所提出方法是一种有效的相似度策略,且与高效索引算法相结合,可降低在大规模数据集的分类时间.

关键词: 机器学习, 近邻, 亲和相似度, 分散度, 紧密度

Abstract:

Traditional distance and similarity measurements did not take into account the influence of the individual sample on the whole sample set. To deal with this issue, a new similarity improvement strategy of k-nearest neighbor algorithm (KNN) is proposed in the paper. First, a new affinity distance function is introduced, which focuses on the separation and compactness between each individual sample and the whole sample set. Second, a new similarity function using this affinity distance function is proposed and taken as the similarity measure function in the KNN. Third, a theoretical analysis of and experiments on eighteen numerical UCI (University of California Irvine) datasets are made to compare the affinity similarity function proposed in this paper with classical distance or similarity functions through 5-fold partitioning cross-validations. Finally, classification results indicate that the proposed affinity similarity function is not only an effective similarity strategy for classification, but can reduce the classification time for large-scale data sets by combining efficient indexing algorithms.

Key words: machine learning, nearest neighbors, affinity similarity, separation, compactness

Baidu
map