-means clustering
Procedure
- Pick
starting points from a list of sample points. - Each data point is labelled based on which of the starting points it is closest to.
- Calculate the mean location of each cluster (i.e., its clustroid) and perform another pass of re-clustering.
- If the clusters in this pass are the same as those in the previous pass, we can stop.
Considerations
- Vary the sample points chosen to start the algorithm. Ideally, we want low total variance from points to their clustroid.
Choosing
- Vary
depending on the domain knowledge or quality of clustering. - Use the
that is at the ”elbow” of the variance-vs- plot. - This is subjective and sometimes not-obvious in the plot.
- Use the Silhouette method to measure how similar points tend to be to points in their clusters compared to other clusters.
- Pick the
which maximizes the silhouette score.
- Pick the
DBSCAN
- Involves creating clusters starting from select core points— points which have
other points within distance of them (for tunable and ). - The cluster is extended by adding all core points that are reachable (within distance
) from core points already in the cluster. - Finally, all non-core points that are distance
to a core point in the cluster are added to the cluster. - This process is repeated until all points are added.
- For more information on this see the Wikipedia page
Hierarchical
- A technique that seeks to build a hierarchy of clusters from a specified distance metric
- Agglomerative Clustering is bottom up—each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy.
- Divisive Clustering is top-down—all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Calculating Distance between Clusters
For each of these,
let
-
Complete Linkage Clustering - clusters based on the maximum distance between two elements of two clusters.
-
Single Linkage Clustering - clusters based on the minimum distance between two elements of two clusters.
-
Unweighted Average Linkage Clustering - clusters based on the average distance between all pairs of elements from
.
Gaussian Mixture Models
- Assume all datapoints with a certain label are distributed based on a Gaussian distribution parameterized by label-specific mean
, covariance and label probability . - The goal, then, is to maximize the probability of observing a datapoint given the label-specific Gaussian distributions.
- This is done by tweaking each of the
, and .
- This is done by tweaking each of the
- See more here