Python实现Apriori算法挖掘频繁项集

本文介绍如何使用Python语言实现Apriori算法，并通过示例演示如何从交易数据集中挖掘频繁项集。

代码示例:

from itertools import combinations
from copy import deepcopy

# 导入数据，并剔除支持度计数小于min_support的1项集
def load_data(data):
    I_dict = {}
    for i in data:
        for j in i:
            I_dict[j] = I_dict.get(j, 0) + 1
    F_dict = deepcopy(I_dict)
    for k in I_dict.keys():
        if F_dict.get(k) < min_support:
            del F_dict[k]
    return F_dict


# 判断频繁项集是否大于min_support
def get_support_set(p_set):
    item_supp_set = []
    for item in p_set:
        count = 0
        for ds in data_set:
            if item.issubset(ds):
                count += 1
        if count >= min_support:
            item_supp_set.append([item, count])
    return item_supp_set

# 找出所有频繁项集
# 以二项集为初始集
def get_all_items(two_set, k=3):
    all_frequent = []
    flag = True
    while flag:
        mid_set = []
        temp = []
        t_ = [ks[0] for ks in two_set]
        for kk in t_:
            for tt in kk:
                if tt not in temp:
                    temp.append(tt)
        k_ = [set(t) for t in combinations(temp, k)]
        for ff in k_:
            count_k = 0
            for d in t_:
                if ff.issuperset(d):
                    count_k += 1
            if count_k == k:
                mid_set.append(ff)
        frequent_mid_set = get_support_set(mid_set)
        if mid_set:
            k += 1
            two_set = frequent_mid_set
            all_frequent.extend(frequent_mid_set)
        else:
            flag = False
    return all_frequent

if __name__ == '__main__':
    data = [['I1', 'I2', 'I5'],
            ['I2', 'I4'],
            ['I2', 'I3'],
            ['I1', 'I2', 'I4'],
            ['I1', 'I3'],
            ['I2', 'I3'],
            ['I1', 'I3'],
            ['I1', 'I2', 'I3', 'I5'],
            ['I1', 'I2', 'I3']]
    data_set = [set(d) for d in data]
    min_support = 1
    one = [[{lk}, lv] for lk, lv in load_data(data).items()]
    two = [set(t) for t in combinations(list(load_data(data).keys()), 2)]
    two_f_set = get_support_set(two)
    all_frequent_set = one + two_f_set + get_all_items(two_f_set)
    for afs in all_frequent_set:
        print(afs)

程序简要流程:

导入数据并预处理: 加载交易数据集，统计每个项出现的频率，并剔除支持度计数小于设定阈值(min_support)的1项集。
生成候选项集: 从频繁1项集开始，逐步生成候选的2项集、3项集，直到无法生成新的频繁项集。
计算支持度: 遍历数据集，计算每个候选项集的支持度，即包含该候选项集的交易数量占总交易数量的比例。
筛选频繁项集: 将支持度大于等于设定阈值的候选项集保留，作为频繁项集。
输出结果: 打印所有找到的频繁项集及其支持度。

总结:

本文介绍了使用Python实现Apriori算法的方法，并通过代码示例演示了如何从交易数据集中挖掘频繁项集。Apriori算法作为一种经典的关联规则挖掘算法，在购物篮分析、推荐系统等领域有着广泛的应用。