Friday 13 January 2012

K means clustering

Hello friends,

Today in class we learnt about k-means clustering, a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

The two key features of k-means which make it efficient are often regarded as its biggest drawbacks:

- Euclidean distance is used as a metric and variance is used as a measure of cluster scatter

- The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set

K - means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. As such, it has been successfully used in various topics, ranging from market segmentation, computer vision, geo statistics and astronomy to agriculture.

K means clustering is usually used for STP (Segmentation, Target and Positioning) analysis. So once we get equal/valid clusters we do cluster profiling by frequency and split clusters and then to represent data and spot the error we use graphs.

We also have a advanced version in applied statistics, k-means++ is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm.

I hope this information will clear out the basics (which I was confused with).

Will keep posted with new stuff…till than

Cya ;)

No comments:

Post a Comment