数据挖掘聚类的算法有哪些（机器学习聚类实现AAAI）

机器学习聚类实现AAAI任务介绍每年国际上召开的大大小小学术会议不计其数，发表了非常多的论文在计算机领域的一些大型学术会议上，一次就可以发表涉及各个方向的几百篇论文按论文的主题、内容进行聚类，有助于人们高效地查找和获得所需要。

任务介绍

每年国际上召开的大大小小学术会议不计其数，发表了非常多的论文。在计算机领域的一些大型学术会议上，一次就可以发表涉及各个方向的几百篇论文。按论文的主题、内容进行聚类，有助于人们高效地查找和获得所需要的论文。本案例数据来源于AAAI 2014上发表的约400篇文章，由UCI公开提供，提供包括标题、作者、关键词、摘要在内的信息，希望大家能根据这些信息，合理地构造特征向量来表示这些论文，并设计实现或调用聚类算法对论文进行聚类。最后也可以对聚类结果进行观察，看每一类都是什么样的论文，是否有一些主题。

基本要求：

将文本转化为向量，实现或调用无监督聚类算法，对论文聚类，例如10类（可使用已有工具包例如sklearn）；
观察每一类中的论文，调整算法使结果较为合理；
无监督聚类没有标签，效果较难评价，因此没有硬性指标，跑通即可，主要让大家了解和感受聚类算法，比较简单。

扩展要求：

对文本向量进行降维，并将聚类结果可视化成散点图。

注：group和topic也不能完全算是标签，因为

有些文章作者投稿时可能会选择某个group/topic但实际和另外group/topic也相关甚至更相关；
一篇文章可能有多个group和topic，作为标签会出现有的文章同属多个类别，这里暂不考虑这样的聚类；
group和topic的取值很多，但聚类常常希望指定聚合成出例如5/10/20类；
感兴趣但同学可以思考利用group和topic信息来量化评价无监督聚类结果，不作要求。

提示：

高维向量的降维旨在去除一些高相关性的特征维度，保留最有用的信息，用更低维的向量表示高维数据，常用的方法有PCA和t-SNE等；
降维与聚类是两件不同的事情，聚类实际上在降维前的高维向量和降维后的低维向量上都可以进行，结果也可能截然不同；
高维向量做聚类，降维可视化后若有同一类的点不在一起，是正常的。在高维空间中它们可能是在一起的，降维后损失了一些信息。

import pandas as pdfrom collections import Counterarticle = pd.read_csv(r"./data/[UCI] AAAI-14 Accepted Papers - Papers.csv")article

根据abstract的内容对论文进行聚类

article.abstract[0]Out：'Transfer learning considers related but distinct tasks defined on heterogenous domains and tries to transfer knowledge between these tasks to improve generalization performance. It is particularly useful when we do not have sufficient amount of labeled training data in some tasks, which may be very costly, laborious, or even infeasible to obtain. Instead, learning the tasks jointly enables us to effectively increase the amount of labeled training data. In this paper, we formulate a kernelized Bayesian transfer learning framework that is a principled combination of kernel-based dimensionality reduction models with task-specific projection matrices to find a shared subspace and a coupled classification model for all of the tasks in this subspace. Our two main contributions are: (i) two novel probabilistic models for binary and multiclass classification, and (ii) very efficient variational approximation procedures for these models. We illustrate the generalization performance of our algorithms on two different applications. In computer vision experiments, our method outperforms the state-of-the-art algorithms on nine out of 12 benchmark supervised domain adaptation experiments defined on two object recognition data sets. In cancer biology experiments, we use our algorithm to predict mutation status of important cancer genes from gene expression profiles using two distinct cancer populations, namely, patient-derived primary tumor data and in-vitro-derived cancer cell line data. We show that we can increase our generalization performance on primary tumors using cell lines as an auxiliary data source.'from sklearn.feature_extraction.text import CountVectorizer# 建立模型类count_vect = CountVectorizer()# 根据训练数据fit模型X_train_counts = count_vect.fit_transform(list(article.abstract))print('词表:\n',count_vect.vocabulary_)#的词汇表，有多少个，词向量就是多少维度print('词向量矩阵:\n',X_train_counts.toarray())#fit_transform后查看具体向量词向量矩阵: [[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0]]

根据词语的出现次数来构建词袋会出现一个问题：长的文章词语出现的次数会比短的文章要多，而实际上两篇文章可能谈论的都是同一个主题。

于是乎，我们用tf（term frequencies）

单词出现次数除以文章总单词数,来代替出现次数来构建词袋字典。

除此之外，还有一个问题就是一个词如果在很多文章中都有出现，那么它对于区分文章的类别效果就微乎其微了。也就是说它对于我们识别文章所提供的信息就非常地少了。

于是乎就有了——tf-idf(Term Frequency times Inverse Document Frequency

每个词再加上权重来构建词标记。

from sklearn.feature_extraction.text import TfidfTransformertfidfer = TfidfTransformer()tfidf = tfidfer.fit_transform(X_train_counts)print('tfidf向量矩阵：\n',tfidf.toarray())#fit_transform后查看具体向量矩阵tfidf向量矩阵： [[0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]]

进行聚类

from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=10)kmeans.fit(tfidf)y_kmeans = kmeans.predict(tfidf)print("k = 10聚类结果:",y_kmeans)k = 10聚类结果: [2 2 1 7 1 5 1 0 2 5 8 1 2 1 1 8 1 1 2 5 1 1 1 2 2 7 1 1 7 1 1 8 0 1 4 1 1 4 0 7 5 8 7 2 7 1 1 1 0 1 7 1 1 1 1 1 2 1 1 0 1 0 7 1 0 9 1 6 2 0 6 8 1 1 7 2 1 4 1 1 1 2 1 5 1 1 1 2 1 0 5 4 1 1 1 1 2 1 1 1 0 7 2 7 0 1 1 1 2 0 1 1 1 7 1 8 7 1 1 1 1 1 1 8 1 2 0 7 7 1 1 5 1 1 1 6 7 5 2 5 1 0 0 0 2 1 0 0 1 4 1 0 8 1 1 1 7 1 2 1 1 7 1 2 1 7 1 1 1 1 0 1 1 2 2 1 1 1 1 9 1 1 2 0 5 3 1 8 1 1 1 6 8 1 2 1 2 1 9 1 1 8 1 1 2 1 1 0 2 8 1 5 1 1 1 1 1 1 1 1 1 2 1 1 6 1 1 1 1 1 1 1 0 1 0 1 8 8 1 8 1 3 1 2 7 1 1 8 1 5 1 5 1 0 1 1 2 1 0 2 2 7 7 0 1 4 8 5 1 7 6 2 1 2 1 7 2 5 2 7 1 1 0 8 1 5 1 1 1 8 2 1 8 2 0 1 2 1 2 2 1 1 1 1 1 0 7 6 1 1 1 1 1 7 8 1 0 1 2 8 8 1 1 8 0 2 1 0 1 2 1 2 0 9 9 2 0 7 1 1 1 1 7 1 1 8 1 4 2 0 2 1 2 2 8 1 1 1 1 1 0 1 0 1 0 0 6 0 0 1 4 0 2 1 2 1 1 1 0 1 1 1 1 1 1 5 3 8 0 8 4 1 0 7 7 1 3 2]

聚类结果评价指标一一轮廓系数某个点的轮廓系数定义为:

其中disMean int为该点与本类其他点的平均距离，disMeanout为该点与非本类点的平均距离。该值取值范围为 [−1,1][−1,1]

[−1,1][−1,1]

，越接近1则说明分类越优秀。在 sklearn 中函数 si lhouette_score() 计算所有点的平均轮廓系数，而 silhouette_samples() 返回每个点的轮庪系数。

# 评估指标——轮廓系数,前者为所有点的平均轮廓系数，后者返回每个点的轮廓系数from sklearn.metrics import silhouette_score, silhouette_sampless = silhouette_score(tfidf, y_kmeans)s

0.0021600981161913014

def metrics_n(data,n):'''传入聚类数据以及类的个数n，返回聚类结果以及轮廓系数'''kmeans = KMeans(n_clusters=n,random_state=0)kmeans.fit(data)y_kmeans = kmeans.predict(data)s = silhouette_score(data, y_kmeans)return y_kmeans,sscore = []result = []for i in range(2,20):tmp_result = metrics_n(tfidf,i)score.append(tmp_result[1])result.append(tmp_result[0])import matplotlib.pyplot as pltimport matplotlib.pylab as pylabplt.style.use("ggplot")params = {'legend.fontsize': 25,#'x-large','figure.figsize': (15, 8),'axes.labelsize': 25,#'x-large','axes.titlesize': 25,#'x-large','xtick.labelsize': 25,#'x-large','ytick.labelsize': 25,}#'x-large'}pylab.rcParams.update(params)#matplotlib基础设置plt.plot(range(2,20),score)plt.xlabel("$n$")plt.ylabel("$s$");

可以看到n取7时，效果较好

降维我们采用PCA

len(result)

tfidf.toarray()array([[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],...,[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.]])from sklearn.decomposition import PCAclf2=PCA(2)clf2.fit(tfidf.toarray())result_MDS=clf2.fit_transform(tfidf.toarray())result_MDS

result_MDS = pd.DataFrame(result_MDS,columns=['x','y'])result_MDS['label'] = result[5]result_MDS

import seaborn as snssns.scatterplot(x="x", y="y", hue="label",data=result_MDS)plt.title("n = 7")Text(0.5, 1.0, 'n = 7')

plt.figure(figsize=(36,36))n_col = 3n_line = 6position = 1n = 2for i in range(n_line):for j in range(n_col):plt.subplot(n_line,n_col,position)result_MDS['label'] = result[position-1]sns.scatterplot(x="x", y="y", hue="label",data=result_MDS)plt.title("k-means n = " str(n))n =1position =1plt.tight_layout()

取n=7进行结果分析

" ".join(list(article.abstract[np.array(result[5]) == 5]))

import numpy as npfrom wordcloud import WordCloudwordcloud = WordCloud(background_color='white',scale=1.5).generate(" ".join(list(article.abstract[np.array(result[5]) == 0])))plt.imshow(wordcloud)plt.axis('off')

(-0.5, 599.5, 299.5, -0.5)

第0类主要内容

n_line = 3n_col = 3position = 1plt.figure(figsize=(16,16))for i in range(7):plt.subplot(n_col,n_line,position)position =1wordcloud = WordCloud(background_color='white',scale=1.5).generate(" ".join(list(article.abstract[np.array(result[5]) == i])))plt.title("label = " str(i))plt.imshow(wordcloud)plt.axis('off')plt.tight_layout()