Machine Learning - Unit 5: Clustering

Overview

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and being used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

My Reflection

This week was specifically about clustering, which is an unsupervised machine learning type of algorithms that allows grouping and predicting the relevant of unseen data.

The unit's materials included a lecturecast and a short list of readings. One of them was about clustering metrics of evaluation, specifically explaining the Silhouette method and the Sum of Squared Errors. The Silhouette method and SSE are key metrics for evaluating clustering performance. The Silhouette score, ranging from -1 to 1, assesses how well each data point fits within its assigned cluster compared to others—higher scores indicate better-defined clusters. SSE, on the other hand, measures intra-cluster compactness by summing the squared distances between data points and their respective centroids; lower SSE values suggest tighter groupings. Together, these metrics support techniques like the elbow method and silhouette analysis to determine the optimal number of clusters, balancing cohesion and separation in unsupervised learning.

The materials also included two interactive animations that explain the concept of clustering. The The first lets the user choose the centroids randomly or by picking and see how the algorithm proceeds further, based on a few pre-loaded group of datasets.The second shows how the configuration of centroids takes place gradually to minimise the distances between each centroid and the members of its nearest cluster.

For the team project, the colleagues started to refine the linear regression model I built in the previous week, generating further evaluation metrics and visualisations. We also started drafting our report. One of the colleagues also started building another clustering model to infer the different custers of Airbnb assets, based on the given features.

Artefacts

Jaccard Coefficient Calculations

This formative activity introduced the concept of similarity measurement in machine learning through the Jaccard coefficient, applied to a small pathological test dataset. We were asked to compute the Jaccard coefficient for three pairs of individuals—Jack and Mary, Jack and Jim, Jim and Mary—based on binary and categorical features such as symptoms and test results. The task encouraged critical thinking about how data representation affects similarity metrics and clustering outcomes.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N A
Mary F Y N P A P N
Jim M Y P N N N A