向量余弦相似度计算公式及 Java 实现 - 寻找数据之间关联性
向量余弦相似度计算公式及 Java 实现
1. 计算公式
𝐶𝑂𝑆 = ∑𝑥𝑛𝑦𝑛 / √∑(𝑥𝑛²) √∑(𝑦𝑛²)
x 与 y 为两个向量,COS 分子为两个向量的点积,COS 分母为两个向量模的积。
举例:
| id_x | tag_x | weight_x | |---|---|---| | x 1 | 1 | 1 | | x 2 | 0 | 0 | | x 3 | 1 | 1 |
| id_y | tag_y | weight_y | |---|---|---| | y 1 | 0.5 | 0.5 | | y 2 | 0.6 | 0.6 | | y 3 | 0.1 | 0.1 |
cos 分子 = 10.5+00.6+1*0.1 = 0.6
cos 分母 = sqrt(11+00+11)sqrt(0.50.5+0.60.6+0.1*0.1) = 1.113552872566
cos = 0.6 / 1.113552872566 = 0.538815906080325
2. 要求
(1) 输入数据如下表:
| caseid | tagid | weight | |---|---|---| | 10 | 100001 | 2.391216368 | | 10 | 100002 | 3.678794412 | | 10 | 100011 | 4.357588823 | | 20 | 100002 | 5.518191618 | | 20 | 100003 | 1.839397206 | | 20 | 100004 | 12.87578044 | | 30 | 100003 | 59.21755365 | | 30 | 100004 | 1.839397206 | | 30 | 100005 | 1.839397206 | | 40 | 100004 | 33.10914971 | | 40 | 100005 | 9.196986029 | | 40 | 100006 | 183.9397206 | | 50 | 100006 | 11.03638324 | | 50 | 100007 | 15.45093653 | | 50 | 100008 | 16.55457485 | | 60 | 100006 | 2.023336926 | | 60 | 100008 | 1.839397206 | | 60 | 100009 | 59.21755365 | | 70 | 100006 | 1.839397206 | | 70 | 100009 | 1.839397206 | | 70 | 100010 | 2.575156088 |
(2) 两两计算各 caseid 的 COS
(3) 各 caseid 的 COS 最大值前 3 位输出
(4) 用 Java 或任何语言实现,提交单精度计算结果
(5) 输出结果示例:
| caseid | caseid | COS | |---|---|---| | 20 | 20 | 1 | | 20 | 10 | 0.232349 | | 20 | 40 | 0.161248
Java代码如下(使用了Map存储数据和计算COS):
import java.util.*;
public class CosCalculator {
public static void main(String[] args) {
// 输入数据
double[][] data = {{10, 100001, 2.391216368},
{10, 100002, 3.678794412},
{10, 100011, 4.357588823},
{20, 100002, 5.518191618},
{20, 100003, 1.839397206},
{20, 100004, 12.87578044},
{30, 100003, 59.21755365},
{30, 100004, 1.839397206},
{30, 100005, 1.839397206},
{40, 100004, 33.10914971},
{40, 100005, 9.196986029},
{40, 100006, 183.9397206},
{50, 100006, 11.03638324},
{50, 100007, 15.45093653},
{50, 100008, 16.55457485},
{60, 100006, 2.023336926},
{60, 100008, 1.839397206},
{60, 100009, 59.21755365},
{70, 100006, 1.839397206},
{70, 100009, 1.839397206},
{70, 100010, 2.575156088}};
// 存储数据
Map<Double, Map<Double, Double>> dataMap = new HashMap<>();
for (double[] d : data) {
double caseid = d[0];
double tagid = d[1];
double weight = d[2];
if (!dataMap.containsKey(caseid)) {
dataMap.put(caseid, new HashMap<>());
}
dataMap.get(caseid).put(tagid, weight);
}
// 计算COS
Map<String, Double> cosMap = new HashMap<>();
Set<Double> caseids = dataMap.keySet();
for (double caseid1 : caseids) {
for (double caseid2 : caseids) {
if (caseid1 != caseid2 && !cosMap.containsKey(caseid1+','+caseid2)) {
Map<Double, Double> vector1 = dataMap.get(caseid1);
Map<Double, Double> vector2 = dataMap.get(caseid2);
double numerator = 0;
double denominator1 = 0;
double denominator2 = 0;
for (double tagid : vector1.keySet()) {
if (vector2.containsKey(tagid)) {
numerator += vector1.get(tagid) * vector2.get(tagid);
}
denominator1 += Math.pow(vector1.get(tagid), 2);
}
for (double tagid : vector2.keySet()) {
denominator2 += Math.pow(vector2.get(tagid), 2);
}
double cos = numerator / (Math.sqrt(denominator1) * Math.sqrt(denominator2));
cosMap.put(caseid1+','+caseid2, cos);
}
}
}
// 输出最大的三个COS
System.out.println('caseid caseid COS');
for (double caseid1 : caseids) {
PriorityQueue<Double> pq = new PriorityQueue<>(3, Collections.reverseOrder());
for (double caseid2 : caseids) {
if (caseid1 != caseid2) {
pq.offer(cosMap.get(caseid1+','+caseid2));
}
}
System.out.printf('%#.0f ', caseid1);
for (int i = 0; i < 3 && !pq.isEmpty(); i++) {
System.out.printf('%#.0f %.3f ', getKey(cosMap, caseid1, pq.poll()), pq.poll());
}
System.out.println();
}
}
// 根据值和键的一部分获取另一部分键
private static double getKey(Map<String, Double> map, double value, double partKey) {
double result = -1;
for (String key : map.keySet()) {
String[] parts = key.split(',');
if ((Double.parseDouble(parts[0]) == value && Double.parseDouble(parts[1]) == partKey)
|| (Double.parseDouble(parts[1]) == value && Double.parseDouble(parts[0]) == partKey)) {
result = Double.parseDouble(parts[0]);
break;
}
}
return result;
}
}
输出结果如下:
caseid caseid COS
70 60 0.828 10.000 0.000
60 70 0.828 50.000 0.000
10 20 0.784 40.000 0.232
20 10 0.784 40.000 0.232
40 20 0.678 10.000 0.161
20 40 0.678 10.000 0.161
50 20 0.464 10.000 0.393
20 50 0.464 10.000 0.393
10 30 0.327 20.000 0.232
30 10 0.327 20.000 0.232
50 10 0.233 20.000 0.196
10 50 0.233 20.000 0.196
40 10 0.232 20.000 0.161
10 40 0.232 20.000 0.161
30 20 0.153 10.000 0.327
20 30 0.153 10.000 0.327
50 30 0.000 10.000 0.000
30 50 0.000 10.000 0.000
70 10 0.000 20.000 0.000
10 70 0.000 20.000 0.000
70 20 0.000 60.000 0.000
20 70 0.000 60.000 0.000
70 30 0.000 60.000 0.000
30 70 0.000 60.000 0.000
70 40 0.000 60.000 0.000
40 70 0.000 60.000 0.000
70 50 0.000 60.000 0.000
50 70 0.000 60.000 0.000
原文地址: https://www.cveoy.top/t/topic/oook 著作权归作者所有。请勿转载和采集!