Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed. The parameters must be specified by the user. Ideally, the value of ε is given by the problem to solve (e.g. a physical distance), and minPts is then the desired minimum cluster size.[a]
OPTICS can be seen as a generalization of DBSCAN that replaces the ε parameter with a maximum value that mostly affects performance. MinPts then essentially becomes the minimum cluster size to find. While the algorithm is much easier to parameterize than DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces.
Recently, one of the original authors of DBSCAN has revisited DBSCAN and OPTICS, and published a refined version of hierarchical DBSCAN (HDBSCAN*), which no longer has the notion of border points. Instead, only the core points form the cluster.
There are several ways to determine it:
1) k-distance plot
In a clustering with minPts = k, we expect that core pints and border points’ k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide
The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points.
The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.
Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.
A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.
It can be seen that the optimal eps value is around a distance of 0.15.
2) DBSCAN extensions like OPTICS
OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.
3) sensitivity analysis
Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised machine learning technique used to identify clusters of varying shape in a data set (Ester et al. 1996). Another post I wrote goes into what DBSCAN is and when to use it. You can find it here. This post will focus on estimating DBSCAN’s two parameters:
There is no automatic way to determine the MinPts value for DBSCAN. Ultimately, the MinPts value should be set using domain knowledge and familiarity with the data set. From some research I’ve done, here are a few rules of thumb for selecting the MinPts value:
After you select your MinPts value, you can move on to determining ε. One technique to automatically determine the optimal ε value is described in this paper. This technique calculates the average distance between each point and its k nearest neighbors, where k = the MinPts value you selected. The average k-distances are then plotted in ascending order on a k-distance graph. You’ll find the optimal value for ε at the point of maximum curvature (i.e. where the graph has the greatest slope).
Recently, I worked with a data set with 10 dimensions. Following the rules above, my minimum samples value should be at least 10. Ultimately, I decided to go with 20 (2*dim), as suggested in papers 1 and 2 listed under resources at the end of this post.
After I selected my MinPts value, I used NearestNeighbors from Scikit-learn, documentation here, to calculate the average distance between each point and its n_neighbors. The one parameter you need to define is n_neighbors, which in this case is the value you choose for MinPts.
Step 1: Import Libraries
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
Step 2: Calculate the average distance between each point in the data set and its 20 nearest neighbors (my selected MinPts value).
neighbors = NearestNeighbors(n_neighbors=20)
neighbors_fit = neighbors.fit(dataset)
distances, indices = neighbors_fit.kneighbors(dataset)
Step 3: Sort distance values by ascending value and plot
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
For my data set, the sorted distances produces a k-distance elbow plot that looks like this:
The ideal value for ε will be equal to the distance value at the “crook of the elbow”, or the point of maximum curvature. This point represents the optimization point where diminishing returns are no longer worth the additional cost. This concept of diminishing returns applies here because while increasing the number of clusters will always improve the fit of the model, it also increases the risk that overfitting will occur.
Zooming in on my k-distance plot, it looks like the optimal value for ε is around 225. I ended up looping through combinations of MinPts and ε values slightly above and below the values estimated here to find the model of best fit.
As mentioned previously, we must provide a value for epsilon which defines the maximum distance between two points. The following paper, describes an approach for automatically determining the optimal value for Eps.
In layman’s terms, we find a suitable value for epsilon by calculating the distance to the nearest n points for each point, sorting and plotting the results. Then we look to see where the change is most pronounced (think of the angle between your arm and forearm) and select that as epsilon.
We can calculate the distance from each point to its closest neighbour using the NearestNeighbors
. The point itself is included in n_neighbors
. The kneighbors
method returns two arrays, one which contains the distance to the closest n_neighbors
points and the other which contains the index for each of those points.
neigh = NearestNeighbors(n_neighbors=2)nbrs = neigh.fit(X)distances, indices = nbrs.kneighbors(X)
Next, we sort and plot results.
distances = np.sort(distances, axis=0)distances = distances[:,1]plt.plot(distances)
The optimal value for epsilon will be found at the point of maximum curvature.