1. 熵、经验条件熵、信息增益算法实现:

熵:用于衡量数据集的无序程度,熵越大表示数据集越无序。

经验条件熵:在已知某个特征的情况下,用于衡量数据集的无序程度。

信息增益:用于衡量某个特征对于分类的贡献程度,信息增益越大表示该特征对于分类的贡献越大。

代码实现:

import math

def calc_entropy(dataset):
    num_entries = len(dataset)
    label_counts = {}
    for feat_vec in dataset:
        current_label = feat_vec[-1]
        if current_label not in label_counts.keys():
            label_counts[current_label] = 0
        label_counts[current_label] += 1
    entropy = 0.0
    for key in label_counts:
        prob = float(label_counts[key]) / num_entries
        entropy -= prob * math.log(prob, 2)
    return entropy

def split_dataset(dataset, axis, value):
    ret_dataset = []
    for feat_vec in dataset:
        if feat_vec[axis] == value:
            reduced_feat_vec = feat_vec[:axis]
            reduced_feat_vec.extend(feat_vec[axis+1:])
            ret_dataset.append(reduced_feat_vec)
    return ret_dataset

def calc_cond_entropy(dataset, axis, value):
    sub_dataset = split_dataset(dataset, axis, value)
    prob = len(sub_dataset) / float(len(dataset))
    return prob * calc_entropy(sub_dataset)

def calc_info_gain(dataset, base_entropy, axis):
    feat_list = [example[axis] for example in dataset]
    unique_vals = set(feat_list)
    new_entropy = 0.0
    for value in unique_vals:
        sub_dataset = split_dataset(dataset, axis, value)
        prob = len(sub_dataset) / float(len(dataset))
        new_entropy += prob * calc_entropy(sub_dataset)
    info_gain = base_entropy - new_entropy
    return info_gain
  1. ID3算法实现:

ID3算法是一种基于信息增益的决策树算法,其核心思想是在每个节点选择信息增益最大的特征作为划分标准,直到所有数据都属于同一类别或者没有特征可供选择为止。

代码实现:

def choose_best_feature_to_split(dataset):
    num_features = len(dataset[0]) - 1
    base_entropy = calc_entropy(dataset)
    best_info_gain = 0.0
    best_feature = -1
    for i in range(num_features):
        info_gain = calc_info_gain(dataset, base_entropy, i)
        if info_gain > best_info_gain:
            best_info_gain = info_gain
            best_feature = i
    return best_feature

def majority_cnt(class_list):
    class_count = {}
    for vote in class_list:
        if vote not in class_count.keys():
            class_count[vote] = 0
        class_count[vote] += 1
    sorted_class_count = sorted(class_count.items(), key=lambda x:x[1], reverse=True)
    return sorted_class_count[0][0]

def create_tree(dataset, labels):
    class_list = [example[-1] for example in dataset]
    if class_list.count(class_list[0]) == len(class_list):
        return class_list[0]
    if len(dataset[0]) == 1:
        return majority_cnt(class_list)
    best_feat = choose_best_feature_to_split(dataset)
    best_feat_label = labels[best_feat]
    my_tree = {best_feat_label:{}}
    del(labels[best_feat])
    feat_values = [example[best_feat] for example in dataset]
    unique_vals = set(feat_values)
    for value in unique_vals:
        sub_labels = labels[:]
        my_tree[best_feat_label][value] = create_tree(split_dataset(dataset, best_feat, value), sub_labels)
    return my_tree
  1. sklearn库中的决策树算法:

sklearn库中提供了多种决策树算法,包括CART、ID3、C4.5等,其中CART算法是最常用的一种,其核心思想是在每个节点选择基尼指数最小的特征作为划分标准。

代码实现:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
  1. iris数据集预测分类:

iris数据集是一个经典的分类数据集,包含3种不同种类的鸢尾花,每种鸢尾花有4个特征(花萼长度、花萼宽度、花瓣长度、花瓣宽度),共150个样本。

算法思路:

首先,我们需要将数据集划分为训练集和测试集。然后,使用sklearn库中的决策树算法对训练集进行训练,并对测试集进行预测。最后,使用准确率评估模型的性能。

程序存在的问题:

  1. 决策树算法容易过拟合,需要进行剪枝处理;
  2. 决策树算法对于连续型特征的处理不够优秀,需要进行离散化处理;
  3. 决策树算法对于缺失值的处理不够优秀,需要进行处理或者使用其他算法。

原文地址: http://www.cveoy.top/t/topic/bGoa 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录