Skip to main content

Cluster Detection

Cluster detection in quality analytics identifies distinct subgroups (modes) within process data. Unlike outlier detection, which flags individual extreme points, cluster detection finds coherent subpopulations that may have different means, variances, or distribution shapes.

Why It Matters

Multimodal data is surprisingly common in manufacturing. Two tool inserts with slightly different geometry produce two clusters of measurements. A thermal drift creates a gradual transition between two operating modes. Material from two different suppliers creates distinct populations within the same production run.

When clusters are present, every single-distribution statistic — mean, standard deviation, Cpk, control limits — is wrong. The mean falls between the clusters (where no data actually exists), the standard deviation is inflated by the inter-cluster distance, and control limits are too wide to detect shifts within either cluster.

Traditional cluster detection methods (k-means, Gaussian mixture models) require specifying the number of clusters in advance and assume each cluster is normally distributed. In practice, quality engineers often do not know how many clusters to expect, and the clusters may not be Gaussian.

The EntropyStat Perspective

EntropyStat's ELDF (Entropic Local Distribution Function) detects clusters as a natural byproduct of local distribution analysis. Unlike k-means or GMM, the ELDF does not require specifying the number of clusters in advance — it discovers them from the local density structure of the data. And unlike GMM, it does not assume each cluster is normally distributed.

The detection works by analyzing the ELDF's density structure. Peaks in the local density correspond to cluster centers, and valleys correspond to boundaries between clusters. This topographic approach is robust to cluster shape: overlapping clusters, clusters of different sizes, and non-elliptical clusters are all detected naturally.

Once clusters are detected, EntropyStat fits a separate EGDF to each one. This means you get per-cluster capability indices, per-cluster control limits, and per-cluster tolerance intervals — all computed with the same assumption-free entropy methods that handle the global analysis. The engineer sees not just "your data has two clusters" but a complete analytical profile of each subpopulation.

Related Terms

See Entropy-Powered Analysis in Action

Upload your data and compare traditional SPC with entropy-based methods. Free demo — no credit card required.