A review on Deep learning based recommendation systems
November 24, 2020
Show all

How to determine epsilon and MinPts parameters of DBSCAN clustering

Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed. The parameters must be specified by the user. Ideally, the value of ε is given by the problem to solve (e.g. a physical distance), and minPts is then the desired minimum cluster size.[a]

  • MinPts: As a rule of thumb, a minimum minPts can be derived from the number of dimensions D in the data set, as minPts ≥ D + 1. The low value of minPts = 1 does not make sense, as then every point on its own will already be a cluster. With minPts ≤ 2, the result will be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually better for data sets with noise and will yield more significant clusters. As a rule of thumb, minPts = 2·dim can be used, but it may be necessary to choose larger values for very large data, for noisy data or for data that contains many duplicates.
  • ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of ε are where this plot shows an “elbow”. if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable, and as a rule of thumb only a small fraction of points should be within this distance of each other. Alternatively, an OPTICS plot can be used to choose ε, but then the OPTICS algorithm itself can be used to cluster the data.
  • Distance function: The choice of distance function is tightly coupled to the choice of ε, and has a major impact on the results. In general, it will be necessary to first identify a reasonable measure of similarity for the data set, before the parameter ε can be chosen. There is no estimation for this parameter, but the distance functions needs to be chosen appropriately for the data set. For example, on geographic data, the great-circle distance is often a good choice.

OPTICS can be seen as a generalization of DBSCAN that replaces the ε parameter with a maximum value that mostly affects performance. MinPts then essentially becomes the minimum cluster size to find. While the algorithm is much easier to parameterize than DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces.

Recently, one of the original authors of DBSCAN has revisited DBSCAN and OPTICS, and published a refined version of hierarchical DBSCAN (HDBSCAN*), which no longer has the notion of border points. Instead, only the core points form the cluster.

There are several ways to determine it:

1) k-distance plot

In a clustering with minPts = k, we expect that core pints and border points’ k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide 

The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points.

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.

Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.

It can be seen that the optimal eps value is around a distance of 0.15.

2) DBSCAN extensions like OPTICS

OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.

3) sensitivity analysis

Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised machine learning technique used to identify clusters of varying shape in a data set (Ester et al. 1996). Another post I wrote goes into what DBSCAN is and when to use it. You can find it here. This post will focus on estimating DBSCAN’s two parameters:

  1. Minimum samples (“MinPts”): the fewest number of points required to form a cluster
  2. ε (epsilon or “eps”): the maximum distance two points can be from one another while still belonging to the same cluster

Minimum Samples (“MinPts”)

There is no automatic way to determine the MinPts value for DBSCAN. Ultimately, the MinPts value should be set using domain knowledge and familiarity with the data set. From some research I’ve done, here are a few rules of thumb for selecting the MinPts value:

  • The larger the data set, the larger the value of MinPts should be
  • If the data set is noisier, choose a larger value of MinPts
  • Generally, MinPts should be greater than or equal to the dimensionality of the data set
  • For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).
  • If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set (Sander et al., 1998).

Epsilon (ε)

After you select your MinPts value, you can move on to determining ε. One technique to automatically determine the optimal ε value is described in this paper. This technique calculates the average distance between each point and its k nearest neighbors, where k = the MinPts value you selected. The average k-distances are then plotted in ascending order on a k-distance graph. You’ll find the optimal value for ε at the point of maximum curvature (i.e. where the graph has the greatest slope).

Example

Recently, I worked with a data set with 10 dimensions. Following the rules above, my minimum samples value should be at least 10. Ultimately, I decided to go with 20 (2*dim), as suggested in papers 1 and 2 listed under resources at the end of this post.

After I selected my MinPts value, I used NearestNeighbors from Scikit-learn, documentation here, to calculate the average distance between each point and its n_neighbors. The one parameter you need to define is n_neighbors, which in this case is the value you choose for MinPts.

Step 1: Import Libraries

from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt

Step 2: Calculate the average distance between each point in the data set and its 20 nearest neighbors (my selected MinPts value).

neighbors = NearestNeighbors(n_neighbors=20)
neighbors_fit = neighbors.fit(dataset)
distances, indices = neighbors_fit.kneighbors(dataset)

Step 3: Sort distance values by ascending value and plot

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)

For my data set, the sorted distances produces a k-distance elbow plot that looks like this:

Image for post
Points sorted by distance to the 20th nearest neighbor

The ideal value for ε will be equal to the distance value at the “crook of the elbow”, or the point of maximum curvature. This point represents the optimization point where diminishing returns are no longer worth the additional cost. This concept of diminishing returns applies here because while increasing the number of clusters will always improve the fit of the model, it also increases the risk that overfitting will occur.

Image for post
Zoom in on “elbow” optimization point, points sorted by distance to the 20th nearest neighbor

Zooming in on my k-distance plot, it looks like the optimal value for ε is around 225. I ended up looping through combinations of MinPts and ε values slightly above and below the values estimated here to find the model of best fit.

As mentioned previously, we must provide a value for epsilon which defines the maximum distance between two points. The following paper, describes an approach for automatically determining the optimal value for Eps.

https://iopscience.iop.org/article/10.1088/1755-1315/31/1/012012/pdf

In layman’s terms, we find a suitable value for epsilon by calculating the distance to the nearest n points for each point, sorting and plotting the results. Then we look to see where the change is most pronounced (think of the angle between your arm and forearm) and select that as epsilon.

Image for post

We can calculate the distance from each point to its closest neighbour using the NearestNeighbors. The point itself is included in n_neighbors. The kneighbors method returns two arrays, one which contains the distance to the closest n_neighbors points and the other which contains the index for each of those points.

neigh = NearestNeighbors(n_neighbors=2)nbrs = neigh.fit(X)distances, indices = nbrs.kneighbors(X)
Image for post

Next, we sort and plot results.

distances = np.sort(distances, axis=0)distances = distances[:,1]plt.plot(distances)

The optimal value for epsilon will be found at the point of maximum curvature.

Image for post

https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc

Amir Masoud Sefidian
Amir Masoud Sefidian
Data Scientist, Researcher, Software Developer

Comments are closed.