-means clustering

Procedure

  • Pick starting points from a list of sample points.
  • Each data point is labelled based on which of the starting points it is closest to.
  • Calculate the mean location of each cluster (i.e., its clustroid) and perform another pass of re-clustering.
    • If the clusters in this pass are the same as those in the previous pass, we can stop.

Considerations

  • Vary the sample points chosen to start the algorithm. Ideally, we want low total variance from points to their clustroid.

Choosing

  • Vary depending on the domain knowledge or quality of clustering.
  • Use the that is at the ”elbow” of the variance-vs- plot.
    • This is subjective and sometimes not-obvious in the plot.
  • Use the Silhouette method to measure how similar points tend to be to points in their clusters compared to other clusters.
    • Pick the which maximizes the silhouette score.

DBSCAN

  • Involves creating clusters starting from select core points— points which have other points within distance of them (for tunable and ).
  • The cluster is extended by adding all core points that are reachable (within distance ) from core points already in the cluster.
  • Finally, all non-core points that are distance to a core point in the cluster are added to the cluster.
  • This process is repeated until all points are added.
  • For more information on this see the Wikipedia page

Hierarchical

  • A technique that seeks to build a hierarchy of clusters from a specified distance metric
  • Agglomerative Clustering is bottom up—each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive Clustering is top-down—all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Calculating Distance between Clusters

For each of these,

let denote the distance between two clusters , and

the distance between two data points.

  • Complete Linkage Clustering - clusters based on the maximum distance between two elements of two clusters.

  • Single Linkage Clustering - clusters based on the minimum distance between two elements of two clusters.

  • Unweighted Average Linkage Clustering - clusters based on the average distance between all pairs of elements from .

Gaussian Mixture Models

  • Assume all datapoints with a certain label are distributed based on a Gaussian distribution parameterized by label-specific mean , covariance and label probability .
  • The goal, then, is to maximize the probability of observing a datapoint given the label-specific Gaussian distributions.
    • This is done by tweaking each of the , and .
  • See more here

Links