余弦相似度计算:案例分析及代码实现
余弦相似度计算公式及应用
余弦相似度(Cosine Similarity)是衡量两个向量之间相似度的一种方法。公式如下:
COS = ∑(xn * yn) / (√∑(xn^2) * √∑(yn^2))
其中:
- x 和 y 为两个向量
- COS 分子为两个向量的点积
- COS 分母为两个向量模的积
举例:
假设有两个向量 x 和 y,其数据如下:
| id_x | tag_x | weight_x | |---|---|---| | x 1 | 1 | 1 | | x 2 | 0 | 0 | | x 3 | 1 | 1 |
| id_y | tag_y | weight_y | |---|---|---| | y 1 | 0.5 | 0.5 | | y 2 | 0.6 | 0.6 | | y 3 | 0.1 | 0.1 |
则:
- cos 分子 = 1 * 0.5 + 0 * 0.6 + 1 * 0.1 = 0.6
- cos 分母 = √(1 * 1 + 0 * 0 + 1 * 1) * √(0.5 * 0.5 + 0.6 * 0.6 + 0.1 * 0.1) = 1.113552872566
- cos = 0.6 / 1.113552872566 = 0.538815906080325
要求
输入数据:
| caseid | tagid | weight | |---|---|---| | 10 | 100001 | 2.391216368 | | 10 | 100002 | 3.678794412 | | 10 | 100011 | 4.357588823 | | 20 | 100002 | 5.518191618 | | 20 | 100003 | 1.839397206 | | 20 | 100004 | 12.87578044 | | 30 | 100003 | 59.21755365 | | 30 | 100004 | 1.839397206 | | 30 | 100005 | 1.839397206 | | 40 | 100004 | 33.10914971 | | 40 | 100005 | 9.196986029 | | 40 | 100006 | 183.9397206 | | 50 | 100006 | 11.03638324 | | 50 | 100007 | 15.45093653 | | 50 | 100008 | 16.55457485 | | 60 | 100006 | 2.023336926 | | 60 | 100008 | 1.839397206 | | 60 | 100009 | 59.21755365 | | 70 | 100006 | 1.839397206 | | 70 | 100009 | 1.839397206 | | 70 | 100010 | 2.575156088 |
要求:
- 两两计算各 caseid 的 COS
- 各 caseid 的 COS 最大值前 3 位输出
- 用 Java 或任何语言实现,提交单精度计算结果
输出结果示例:
| caseid | caseid | COS | |---|---|---| | 20 | 20 | 1 | | 20 | 10 | 0.232349 | | 20 | 40 | 0.161248 |
代码实现
import java.util.*;
public class CosineSimilarity {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
Map<Integer, Map<Integer, Float>> data = new HashMap<>();
Set<Integer> caseIds = new HashSet<>();
while (scanner.hasNext()) {
int caseId = scanner.nextInt();
int tagId = scanner.nextInt();
float weight = scanner.nextFloat();
if (!data.containsKey(caseId)) {
data.put(caseId, new HashMap<>());
}
data.get(caseId).put(tagId, weight);
caseIds.add(caseId);
}
List<Integer> caseIdList = new ArrayList<>(caseIds);
Collections.sort(caseIdList);
for (int i = 0; i < caseIdList.size(); i++) {
for (int j = i + 1; j < caseIdList.size(); j++) {
int caseId1 = caseIdList.get(i);
int caseId2 = caseIdList.get(j);
float cos = cosineSimilarity(data.get(caseId1), data.get(caseId2));
System.out.println(caseId1 + " " + caseId2 + " " + String.format(".6f", cos));
}
}
Map<Integer, Float> maxCosMap = new HashMap<>();
for (int i = 0; i < caseIdList.size(); i++) {
int caseId1 = caseIdList.get(i);
float maxCos = 0;
for (int j = 0; j < caseIdList.size(); j++) {
if (i == j) {
continue;
}
int caseId2 = caseIdList.get(j);
float cos = cosineSimilarity(data.get(caseId1), data.get(caseId2));
if (cos > maxCos) {
maxCos = cos;
}
}
maxCosMap.put(caseId1, maxCos);
}
List<Map.Entry<Integer, Float>> maxCosList = new ArrayList<>(maxCosMap.entrySet());
Collections.sort(maxCosList, (a, b) -> Float.compare(b.getValue(), a.getValue()));
for (int i = 0; i < Math.min(3, maxCosList.size()); i++) {
System.out.println(maxCosList.get(i).getKey() + " " + String.format(".3f", maxCosList.get(i).getValue()));
}
}
private static float cosineSimilarity(Map<Integer, Float> vector1, Map<Integer, Float> vector2) {
float dotProduct = 0;
for (Map.Entry<Integer, Float> entry : vector1.entrySet()) {
int tagId = entry.getKey();
float weight1 = entry.getValue();
if (vector2.containsKey(tagId)) {
float weight2 = vector2.get(tagId);
dotProduct += weight1 * weight2;
}
}
float norm1 = 0;
for (float weight : vector1.values()) {
norm1 += weight * weight;
}
norm1 = (float) Math.sqrt(norm1);
float norm2 = 0;
for (float weight : vector2.values()) {
norm2 += weight * weight;
}
norm2 = (float) Math.sqrt(norm2);
return dotProduct / (norm1 * norm2);
}
}
总结
本文介绍了余弦相似度计算公式及其应用,并通过示例数据展示了计算过程。同时提供了 Java 代码实现,可用于计算两个向量的余弦相似度。该方法在自然语言处理、推荐系统、图像识别等领域都有广泛的应用。
原文地址: https://www.cveoy.top/t/topic/oon0 著作权归作者所有。请勿转载和采集!