You have decided to compare the solution from k-means to solutions derived from agglomerative hierarchical clustering to ensure that your solution is optimal Do NOTassume that the optimal number of cl

Q2.1a. Using complete linkage: After running the linkage function with complete linkage and euclidean distance metric, we can obtain the dendrogram by using the dendrogram function. The dendrogram shows the optimal cluster allocations.

Z = linkage(X, 'complete', 'euclidean');
cutoff = mean([Z(end-K+1,3) Z(end-K+2,3)]);
inds = cluster(Z, 'Cutoff', cutoff, 'Criterion', 'distance');

To obtain the optimal number and locations of the recharge stations, we can use the cluster indices obtained from the cluster function and calculate the mean location of each cluster.

recharge_stations = zeros(K, 2);
for i = 1:K
    cluster_indices = find(inds == i);
    cluster_mean = mean(X(cluster_indices, :));
    recharge_stations(i, :) = cluster_mean;
end

We can then plot the pump locations and recharge stations on the same graph and colour the pumps according to their recharge station assignments.

figure;
scatter(X(:,1), X(:,2), [], inds, 'filled');
hold on;
scatter(recharge_stations(:,1), recharge_stations(:,2), 'rx');
title('Complete linkage');

b. We can obtain a plot of the cost versus the number of recharge stations by calculating the total cost for each number of recharge stations and plotting it against the number of recharge stations.

costs = zeros(K, 1);
for i = 1:K
    cluster_indices = find(inds == i);
    cluster_mean = mean(X(cluster_indices, :));
    costs(i) = sum(pump_costs(cluster_indices)) + sum(recharge_costs(cluster_mean));
end

figure;
plot(1:K, costs);
xlabel('Number of recharge stations');
ylabel('Total cost');
title('Complete linkage');

Q2.2. Repeat using average linkage: We can repeat the above steps using average linkage instead of complete linkage.

Z = linkage(X, 'average', 'euclidean');
cutoff = mean([Z(end-K+1,3) Z(end-K+2,3)]);
inds = cluster(Z, 'Cutoff', cutoff, 'Criterion', 'distance');

recharge_stations = zeros(K, 2);
for i = 1:K
    cluster_indices = find(inds == i);
    cluster_mean = mean(X(cluster_indices, :));
    recharge_stations(i, :) = cluster_mean;
end

figure;
scatter(X(:,1), X(:,2), [], inds, 'filled');
hold on;
scatter(recharge_stations(:,1), recharge_stations(:,2), 'rx');
title('Average linkage');

costs = zeros(K, 1);
for i = 1:K
    cluster_indices = find(inds == i);
    cluster_mean = mean(X(cluster_indices, :));
    costs(i) = sum(pump_costs(cluster_indices)) + sum(recharge_costs(cluster_mean));
end

figure;
plot(1:K, costs);
xlabel('Number of recharge stations');
ylabel('Total cost');
title('Average linkage');

Q2.3. Repeat using single linkage: We can repeat the above steps using single linkage instead of complete linkage.

Z = linkage(X, 'single', 'euclidean');
cutoff = mean([Z(end-K+1,3) Z(end-K+2,3)]);
inds = cluster(Z, 'Cutoff', cutoff, 'Criterion', 'distance');

recharge_stations = zeros(K, 2);
for i = 1:K
    cluster_indices = find(inds == i);
    cluster_mean = mean(X(cluster_indices, :));
    recharge_stations(i, :) = cluster_mean;
end

figure;
scatter(X(:,1), X(:,2), [], inds, 'filled');
hold on;
scatter(recharge_stations(:,1), recharge_stations(:,2), 'rx');
title('Single linkage');

costs = zeros(K, 1);
for i = 1:K
    cluster_indices = find(inds == i);
    cluster_mean = mean(X(cluster_indices, :));
    costs(i) = sum(pump_costs(cluster_indices)) + sum(recharge_costs(cluster_mean));
end

figure;
plot(1:K, costs);
xlabel('Number of recharge stations');
ylabel('Total cost');
title('Single linkage');

Q2.4. Which linkage gives the same result as k-means? Why? Average linkage gives the same result as k-means because both methods try to minimise the variance within each cluster. K-means does this by assigning each point to the closest cluster centre (in terms of euclidean distance), then recalculating the cluster centres to minimise the variance. Average linkage does this by calculating the average distance between all pairs of points in two clusters, then merging the two clusters with the smallest distance until the desired number of clusters is obtained. Both methods result in clusters with similar variance.

Q2.5. Explain why single linkage gives such poor results. Single linkage can give poor results because it is susceptible to the chaining effect, where clusters are formed based on the closest pair of points, regardless of how far apart the rest of the points in the clusters are. This can lead to long, thin clusters that are not meaningful. Additionally, single linkage can be sensitive to noise and outliers, as a single point can be enough to link two clusters together

You have decided to compare the solution from k-means to solutions derived from agglomerative hierarchical clustering to ensure that your solution is optimal Do NOTassume that the optimal number of cl