余弦相似度计算公式及应用

余弦相似度(Cosine Similarity)是衡量两个向量之间相似度的一种方法。公式如下:

COS = ∑(xn * yn) / (√∑(xn^2) * √∑(yn^2))

其中:

  • x 和 y 为两个向量
  • COS 分子为两个向量的点积
  • COS 分母为两个向量模的积

举例:

假设有两个向量 x 和 y,其数据如下:

| id_x | tag_x | weight_x | |---|---|---| | x 1 | 1 | 1 | | x 2 | 0 | 0 | | x 3 | 1 | 1 |

| id_y | tag_y | weight_y | |---|---|---| | y 1 | 0.5 | 0.5 | | y 2 | 0.6 | 0.6 | | y 3 | 0.1 | 0.1 |

则:

  • cos 分子 = 1 * 0.5 + 0 * 0.6 + 1 * 0.1 = 0.6
  • cos 分母 = √(1 * 1 + 0 * 0 + 1 * 1) * √(0.5 * 0.5 + 0.6 * 0.6 + 0.1 * 0.1) = 1.113552872566
  • cos = 0.6 / 1.113552872566 = 0.538815906080325

要求

输入数据:

| caseid | tagid | weight | |---|---|---| | 10 | 100001 | 2.391216368 | | 10 | 100002 | 3.678794412 | | 10 | 100011 | 4.357588823 | | 20 | 100002 | 5.518191618 | | 20 | 100003 | 1.839397206 | | 20 | 100004 | 12.87578044 | | 30 | 100003 | 59.21755365 | | 30 | 100004 | 1.839397206 | | 30 | 100005 | 1.839397206 | | 40 | 100004 | 33.10914971 | | 40 | 100005 | 9.196986029 | | 40 | 100006 | 183.9397206 | | 50 | 100006 | 11.03638324 | | 50 | 100007 | 15.45093653 | | 50 | 100008 | 16.55457485 | | 60 | 100006 | 2.023336926 | | 60 | 100008 | 1.839397206 | | 60 | 100009 | 59.21755365 | | 70 | 100006 | 1.839397206 | | 70 | 100009 | 1.839397206 | | 70 | 100010 | 2.575156088 |

要求:

  1. 两两计算各 caseid 的 COS
  2. 各 caseid 的 COS 最大值前 3 位输出
  3. 用 Java 或任何语言实现,提交单精度计算结果

输出结果示例:

| caseid | caseid | COS | |---|---|---| | 20 | 20 | 1 | | 20 | 10 | 0.232349 | | 20 | 40 | 0.161248 |

代码实现

import java.util.*;

public class CosineSimilarity {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        Map<Integer, Map<Integer, Float>> data = new HashMap<>();
        Set<Integer> caseIds = new HashSet<>();
        while (scanner.hasNext()) {
            int caseId = scanner.nextInt();
            int tagId = scanner.nextInt();
            float weight = scanner.nextFloat();
            if (!data.containsKey(caseId)) {
                data.put(caseId, new HashMap<>());
            }
            data.get(caseId).put(tagId, weight);
            caseIds.add(caseId);
        }
        List<Integer> caseIdList = new ArrayList<>(caseIds);
        Collections.sort(caseIdList);
        for (int i = 0; i < caseIdList.size(); i++) {
            for (int j = i + 1; j < caseIdList.size(); j++) {
                int caseId1 = caseIdList.get(i);
                int caseId2 = caseIdList.get(j);
                float cos = cosineSimilarity(data.get(caseId1), data.get(caseId2));
                System.out.println(caseId1 + " " + caseId2 + " " + String.format(".6f", cos));
            }
        }
        Map<Integer, Float> maxCosMap = new HashMap<>();
        for (int i = 0; i < caseIdList.size(); i++) {
            int caseId1 = caseIdList.get(i);
            float maxCos = 0;
            for (int j = 0; j < caseIdList.size(); j++) {
                if (i == j) {
                    continue;
                }
                int caseId2 = caseIdList.get(j);
                float cos = cosineSimilarity(data.get(caseId1), data.get(caseId2));
                if (cos > maxCos) {
                    maxCos = cos;
                }
            }
            maxCosMap.put(caseId1, maxCos);
        }
        List<Map.Entry<Integer, Float>> maxCosList = new ArrayList<>(maxCosMap.entrySet());
        Collections.sort(maxCosList, (a, b) -> Float.compare(b.getValue(), a.getValue()));
        for (int i = 0; i < Math.min(3, maxCosList.size()); i++) {
            System.out.println(maxCosList.get(i).getKey() + " " + String.format(".3f", maxCosList.get(i).getValue()));
        }
    }

    private static float cosineSimilarity(Map<Integer, Float> vector1, Map<Integer, Float> vector2) {
        float dotProduct = 0;
        for (Map.Entry<Integer, Float> entry : vector1.entrySet()) {
            int tagId = entry.getKey();
            float weight1 = entry.getValue();
            if (vector2.containsKey(tagId)) {
                float weight2 = vector2.get(tagId);
                dotProduct += weight1 * weight2;
            }
        }
        float norm1 = 0;
        for (float weight : vector1.values()) {
            norm1 += weight * weight;
        }
        norm1 = (float) Math.sqrt(norm1);
        float norm2 = 0;
        for (float weight : vector2.values()) {
            norm2 += weight * weight;
        }
        norm2 = (float) Math.sqrt(norm2);
        return dotProduct / (norm1 * norm2);
    }
}

总结

本文介绍了余弦相似度计算公式及其应用,并通过示例数据展示了计算过程。同时提供了 Java 代码实现,可用于计算两个向量的余弦相似度。该方法在自然语言处理、推荐系统、图像识别等领域都有广泛的应用。

余弦相似度计算:案例分析及代码实现

原文地址: https://www.cveoy.top/t/topic/oon0 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录