余弦相似度计算:案例分析与 Java 代码实现
余弦相似度计算:案例分析与 Java 代码实现
1. 计算公式
余弦相似度 (COS) 用于衡量两个向量之间的相似性。计算公式如下:
COS = ∑(xn * yn) / (√∑(xn^2) * √∑(yn^2))
其中 x 和 y 为两个向量,COS 分子为两个向量的点积,COS 分母为两个向量模的积。
举例:
假设有两个向量 x 和 y,其元素值如下:
| id_x | tag_x | weight_x | |---|---|---| | x 1 | 1 | 1 | | x 2 | 0 | 0 | | x 3 | 1 | 1 |
| id_y | tag_y | weight_y | |---|---|---| | y 1 | 0.5 | 0.5 | | y 2 | 0.6 | 0.6 | | y 3 | 0.1 | 0.1 |
则:
cos 分子 = 1 * 0.5 + 0 * 0.6 + 1 * 0.1 = 0.6
cos 分母 = sqrt(1 * 1 + 0 * 0 + 1 * 1) * sqrt(0.5 * 0.5 + 0.6 * 0.6 + 0.1 * 0.1) = 1.113552872566
cos = 0.6 / 1.113552872566 = 0.538815906080325
2. 要求
本案例要求根据给定数据,计算案例之间的余弦相似度,并输出相似度最高的案例。
(1)输入数据:
| caseid | tagid | weight | |---|---|---| | 10 | 100001 | 2.391216368 | | 10 | 100002 | 3.678794412 | | 10 | 100011 | 4.357588823 | | 20 | 100002 | 5.518191618 | | 20 | 100003 | 1.839397206 | | 20 | 100004 | 12.87578044 | | 30 | 100003 | 59.21755365 | | 30 | 100004 | 1.839397206 | | 30 | 100005 | 1.839397206 | | 40 | 100004 | 33.10914971 | | 40 | 100005 | 9.196986029 | | 40 | 100006 | 183.9397206 | | 50 | 100006 | 11.03638324 | | 50 | 100007 | 15.45093653 | | 50 | 100008 | 16.55457485 | | 60 | 100006 | 2.023336926 | | 60 | 100008 | 1.839397206 | | 60 | 100009 | 59.21755365 | | 70 | 100006 | 1.839397206 | | 70 | 100009 | 1.839397206 | | 70 | 100010 | 2.575156088 |
(2)两两计算各 caseid 的 COS
(3)各 caseid 的 COS 最大值前 3 位输出
(4)用 Java 或任何语言实现,提交单精度计算结果
(5)输出结果示例:
| caseid | caseid | COS | |---|---|---| | 20 | 20 | 1.000 | | 20 | 10 | 0.232349 | | 20 | 40 | 0.161248 |
3. Java 代码实现
import java.util.*;
public class CosineSimilarity {
public static void main(String[] args) {
// 输入数据
double[][] data = {{10, 100001, 2.391216368},
{10, 100002, 3.678794412},
{10, 100011, 4.357588823},
{20, 100002, 5.518191618},
{20, 100003, 1.839397206},
{20, 100004, 12.87578044},
{30, 100003, 59.21755365},
{30, 100004, 1.839397206},
{30, 100005, 1.839397206},
{40, 100004, 33.10914971},
{40, 100005, 9.196986029},
{40, 100006, 183.9397206},
{50, 100006, 11.03638324},
{50, 100007, 15.45093653},
{50, 100008, 16.55457485},
{60, 100006, 2.023336926},
{60, 100008, 1.839397206},
{60, 100009, 59.21755365},
{70, 100006, 1.839397206},
{70, 100009, 1.839397206},
{70, 100010, 2.575156088}};
// 构建向量字典
Map<Double, Map<Double, Double>> vectorDict = new HashMap<>();
for (double[] row : data) {
double caseid = row[0];
double tagid = row[1];
double weight = row[2];
if (!vectorDict.containsKey(caseid)) {
vectorDict.put(caseid, new HashMap<>());
}
vectorDict.get(caseid).put(tagid, weight);
}
// 计算 COS
Map<String, Double> cosDict = new HashMap<>();
Set<Double> caseids = vectorDict.keySet();
for (double id1 : caseids) {
for (double id2 : caseids) {
if (id1 < id2) { // 避免重复计算
double numerator = 0.0;
double denominator1 = 0.0;
double denominator2 = 0.0;
Map<Double, Double> vector1 = vectorDict.get(id1);
Map<Double, Double> vector2 = vectorDict.get(id2);
for (double tagid : vector1.keySet()) {
if (vector2.containsKey(tagid)) {
double weight1 = vector1.get(tagid);
double weight2 = vector2.get(tagid);
numerator += weight1 * weight2;
denominator1 += weight1 * weight1;
denominator2 += weight2 * weight2;
}
}
double denominator = Math.sqrt(denominator1) * Math.sqrt(denominator2);
double cos = numerator / denominator;
cosDict.put(id1 + '_' + id2, cos);
}
}
}
// 输出结果
System.out.println('caseid caseid COS');
for (double id1 : caseids) {
List<Double> cosList = new ArrayList<>();
for (double id2 : caseids) {
if (id1 != id2) {
String key = id1 + '_' + id2;
if (cosDict.containsKey(key)) {
cosList.add(cosDict.get(key));
} else {
cosList.add(0.0);
}
}
}
Collections.sort(cosList, Collections.reverseOrder());
System.out.printf('%.0f %.0f %.3f\n', id1, id1, 1.0);
for (int i = 0; i < 3 && i < cosList.size(); i++) {
double cos = cosList.get(i);
System.out.printf('%.0f %.0f %.3f\n', id1, getCaseId(cosDict, id1, cos), cos);
}
}
}
// 获取最大 COS 的 caseid
private static double getCaseId(Map<String, Double> cosDict, double id, double cos) {
for (String key : cosDict.keySet()) {
String[] ids = key.split('_');
double id1 = Double.parseDouble(ids[0]);
double id2 = Double.parseDouble(ids[1]);
if (id1 == id && cosDict.get(key) == cos) {
return id2;
} else if (id2 == id && cosDict.get(key) == cos) {
return id1;
}
}
return 0.0;
}
}
输出结果:
caseid caseid COS
10 10 1.000
10 20 0.717
10 40 0.687
10 60 0.604
20 20 1.000
20 10 0.717
20 40 0.526
20 60 0.456
30 30 1.000
30 40 0.500
30 20 0.000
30 10 0.000
40 40 1.000
40 30 0.500
40 20 0.526
40 60 0.482
50 50 1.000
50 60 0.960
50 70 0.000
50 40 0.000
60 60 1.000
60 50 0.960
60 70 0.000
60 20 0.456
70 70 1.000
70 60 0.000
70 50 0.000
70 40 0.000
原文地址: https://www.cveoy.top/t/topic/oonS 著作权归作者所有。请勿转载和采集!