拉普拉斯分数特征选择算法代码解析

这段代码实现了一种特征选择方法-拉普拉斯分数(Laplacian Score)。

该方法用于评估特征对数据分类或聚类的重要性。代码包含以下几个函数:

construct_W: 用于构造样本之间的权重矩阵 W。该函数使用 K 近邻法 (K-nearest neighbor，KNN) 计算样本之间的相似度，并使用高斯核函数对相似度进行转换，得到权重矩阵 W。最后，为了使得权重矩阵 W 成为对称矩阵，使用了一种方法将 W 转换成对称矩阵。
lap_score: 用于计算每个特征的拉普拉斯分数。该函数首先判断是否传入了权重矩阵 W。如果没有，则调用 construct_W 函数构造权重矩阵 W。然后，计算对角矩阵 D 和图拉普拉斯矩阵 L，最后计算每个特征的拉普拉斯分数。
feature_ranking: 用于对特征按照拉普拉斯分数进行排序，返回排名。
LaplacianScore: 用于计算每个特征的拉普拉斯分数，返回一个一维数组，其中每个元素表示每个特征的拉普拉斯分数。该函数的实现与 lap_score 函数略有不同，它使用了一个不同的公式来计算拉普拉斯分数。

下面是代码的详细解析:

def construct_W(X, **kwargs):
	n_samples, n_features = numpy.shape(X)
	k=kwargs['neighbour_size']
	t = kwargs['t_param']
	S=kng(X, k+1, mode='distance',metric='euclidean') #sqecludian distance works only with mode=connectivity  results were absurd
	S = (-1*(S*S))/(2*t*t)
	S=S.tocsc()
	S=expm(S)
	S=S.tocsr()
	
	
	#[1]  M. Belkin and P. Niyogi, 'Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering', Advances in Neural Information Processing Systems,
	#Vol. 14, 2001. Following the paper to make the weights matrix symmetrix we use this method
	 
	bigger = numpy.transpose(S) > S
	S = S - S.multiply(bigger) + numpy.transpose(S).multiply(bigger)
	return S

	

def lap_score(X, **kwargs):
	
	
    # if 'W' is not specified, use the default W
    if 'W' not in kwargs.keys():
        if 't_param' not in kwargs.keys():
            t_param=1
        else:
            t = kwargs['t_param']
		
        if 'neighbour_size' not in kwargs.keys():
            neighbour_size=5
        else:
            n=kwargs['neighbour_size']
		
        W = construct_W(X,t_param=t,neighbour_size=n)
		
    # construct the affinity matrix W
    else:
        W = kwargs['W']
    
    # build the diagonal D matrix from affinity matrix W
    D = numpy.array(W.sum(axis=1))
    
    
    L = W
    
    
    tmp = numpy.dot(numpy.transpose(D),X)
    D = diags(numpy.transpose(D), [0])
    Xt = numpy.transpose(X)
    t1 = numpy.transpose(numpy.dot(Xt, D.todense()))
    t2 = numpy.transpose(numpy.dot(Xt, L.todense()))
    # compute the numerator of Lr
    tmp=numpy.multiply(tmp, tmp)/D.sum()
    D_prime = numpy.sum(numpy.multiply(t1, X), 0) -tmp 
    # compute the denominator of Lr
    L_prime = numpy.sum(numpy.multiply(t2, X), 0) -tmp 
    # avoid the denominator of Lr to be 0
    D_prime[D_prime < 1e-12] = 10000
    # compute laplacian score for all features
    score = 1 - numpy.array(numpy.multiply(L_prime, 1/D_prime))[0, :]
    return numpy.transpose(score)

'''
    Rank features in ascending order according to their laplacian scores, the smaller the laplacian score is, the more
    important the feature is
'''
def feature_ranking(score):
    idx = numpy.argsort(score, 0)
    return idx+1

def LaplacianScore(X, **kwargs):
	
	if 'W' not in kwargs.keys():
		
		if 't_param' not in kwargs.keys():
			t_param=1
		else:
			t = kwargs['t_param']
		
		if 'neighbour_size' not in kwargs.keys():
			neighbour_size=5
		else:
			n=kwargs['neighbour_size']
			
		W = construct_W(X,t_param=t,neighbour_size=n)
		n_samples, n_features = numpy.shape(X)
    # construct the affinity matrix W
	else:
		W = kwargs['W']
    
    #construct the diagonal matrix
	D=numpy.array(W.sum(axis=1))
	D = diags(numpy.transpose(D), [0])
	#construct graph Laplacian L
	L=D-W.toarray()
	
	#construct 1= [1,···,1]' 
	I=numpy.ones((n_samples,n_features))
	
	#construct fr' => fr= [fr1,...,frn]'
	Xt = numpy.transpose(X)
	
	#construct fr^=fr-(frt D I/It D I)I
	t=numpy.matmul(numpy.matmul(Xt,D.toarray()),I)/numpy.matmul(numpy.matmul(numpy.transpose(I),D.toarray()),I)
	t=t[:,0]
	t=numpy.tile(t,(n_samples,1))
	fr=X-t
	
	#Compute Laplacian Score
	fr_t=numpy.transpose(fr)
	Lr=numpy.matmul(numpy.matmul(fr_t,L),fr)/numpy.matmul(numpy.dot(fr_t,D.toarray()),fr)
	
	return numpy.diag(Lr)```

代码中使用了 numpy 库进行矩阵运算，并使用了 scipy 库中的 kng 函数进行 K 近邻计算。

代码中的 Laplacian Score 的计算方法来自以下论文:

* [1] M. Belkin and P. Niyogi, 'Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering', Advances in Neural Information Processing Systems, Vol. 14, 2001.

使用该方法进行特征选择时，通常将特征按照拉普拉斯分数进行排序，选择分数较低的特征作为重要特征。

代码中使用了几个参数来控制算法的行为，例如:

* **neighbour_size**: K 近邻法中的 K 值，表示每个样本需要考虑多少个最近邻样本。

* **t_param**: 高斯核函数中的参数，用于控制相似度的衰减速度。

用户可以通过修改这些参数来调整算法的行为，以获得最佳的特征选择结果。

需要注意的是，拉普拉斯分数只是一种特征选择方法，它可能不适用于所有类型的特征。用户需要根据实际情况选择合适的特征选择方法。

此外，该代码中还包含了一些注释，可以帮助用户理解代码的功能。用户可以根据自己的需求修改代码，以适应不同的应用场景。