According to this K value, determines the number of clusters to which the dataframe must be divided.
Using silhouette connect code#
KMeans: This function enters the code at each step of loop other n_clusters value. The code uses three different types of clustering algorithms (of course other types can be adapted to it): The scluster.py code based on this logic demonstrated in the case we have saw. Similarly, the code at the end of the process, returns only the clustering labels with the highest silhouette result. It can be seen that we get the best score when k=6 for this specific example. Every result based on a different cluster number value. This example (based on the Kmeans algorithm) shows the differences scores between different clustering results. as you can see in the example, different input values to the clustering function return different silhouette score: Then, the code compares the different results obtained using the Silhouette Score. So, to determine the best input value, the code runs on a specific range of values and each of them is entered into the clustering function. We do see that for K = 6 calculated the highest silhouette-score: 0.662! As we can see in the 3x3 plot, each subplot already has a silhouette score: As we already see, the closer the score is to 1, the better the function performe the separation into clusters. So, we can use the Silhouette to find the best K for split the data.
You can see demonstration of silhouette score calculation on this simple 2d example: The output of the Silhouette can move between 1 (which mean good clustering) to -1 (which mean bad clustering). In fact, the Silhouette is calculated based on each record and its distance to the other records in the same cluster and the distance of that record to the remaining records. The score calculate S(xi) score for every row in the given dataframe as follow:Īfter this calculation, we can get the silhouette score: the Silhouette method can gives us some assessment to the quality of the clustering output. But how can we automatically determine that K = 6 gives the best outcome? To answer this question we can use silhouette score. As you can see, we get different results for each input:īy simply looking at the scatter plot, it can be seen that there are six clusters in the dataset. For example, we will use Kmeans on data_3.xlsx dataset, so each time we set a different K value.
In order to solve this difficulty, the code scluster.py created.Įntering different parameters into the clustering function probabliy result us get different outcomes. This situation causes a lot of time wasted in finding the optimal value for the parameter. But for a dataset with more dimensions, it become much more complex procedure. In the cases of data with 2 or 3 dimensions, we can review manualy the results, and determine whether the clustering successful. Without enough previous knowledge on the data we are researching, we do not know which parameter to choose. Thus parameters effect the quality of the clustering results, and the user mostly dont have idea which choosen parameter gives the optimal output. However, to proccess the data, clustering algorithms require specific parameter definition. OverviewĬlustering is one of the most important methods in the fielf of Machine Learning. Create the optimal clustering by the measure of Silhouette Score automatically.