How can improve the performance of clustering algorithm concerning similar/same records?

Question

How can improve the performance of clustering algorithm concerning similar/same records?

Mario

2021年10月12日 15:50

I want to check/experiment efficiency improvement of clustering algorithm under the title of Statistical preprocessing was done by including statistical frequency (counts) into dataframe concerning similar/same records. According to this paper:

Statistical preprocessing is mainly used to get the frequency of samples having the same features, which are then used as inputs of the DBSCAN algorithm to improve the efficiency of DBSCAN clustering. Statistical preprocessing counts repeated samples with the same features in the URL parameter and uses the statistics frequency as a feature to reduce the size of the matrix, thus avoiding the repeated distance calculation to improve clustering performance.

I think this picture which drawn by me could be the possible roadmap to achieve this: Let's say I have the following synthetic dataframe with some similar/same incidents/records:

+---+-------------+------+------------+-------------+-----------------+
|id |Type         |Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
|0  |Sentence     |4014  |198         |False        |136              |
|1  |contextid    |90    |2           |False        |15               |
|2  |Sentence     |172   |11          |False        |118              |
|3  |String       |12    |0           |True         |11               |
|4  |version-style|16    |0           |False        |13               |
|5  |-            |339   |42          |False        |110              |
|6  |version-style|16    |0           |False        |13               |
|7  |url_variable |10    |2           |False        |9                |
|8  |url_variable |10    |2           |False        |9                |
|9  |null         |172   |11          |False        |117              |
|10 |contextid    |90    |2           |False        |15               |
|11 |-            |170   |11          |False        |114              |
|12 |version-style|16    |0           |False        |13               |
|13 |Sentence     |68    |10          |False        |59               |
|14 |String       |12    |0           |True         |11               |
|15 |Sentence     |173   |11          |False        |118              |
|16 |String       |12    |0           |True         |11               |
|17 |Sentence     |132   |8           |False        |96               |
|18 |String       |12    |0           |True         |11               |
|19 |contextid    |88    |2           |False        |0                |
+---+-------------+------+------------+-------------+-----------------+

After statistical pre-processing step and counts the same rows Freq and normalized them I have:

+---+-------------+------+------------+-------------+-----------------+----+---------------+
|id |Type         |Length|Token_number|Encoding_type|Character_feature|Freq|Normalized_Freq|
+---+-------------+------+------------+-------------+-----------------+----+---------------+
|0  |Sentence     |4014  |198         |False        |136              |1   |0.0            |
|1  |contextid    |90    |2           |False        |15               |2   |0.3333333333333|
|2  |Sentence     |172   |11          |False        |118              |1   |0.0            |
|3  |String       |12    |0           |True         |11               |4   |1.0            |
|4  |version-style|16    |0           |False        |13               |3   |0.6666666666666|
|5  |-            |339   |42          |False        |110              |1   |0.0            |
|6  |version-style|16    |0           |False        |13               |3   |0.6666666666666|
|7  |url_variable |10    |2           |False        |9                |2   |0.3333333333333|
|8  |url_variable |10    |2           |False        |9                |2   |0.3333333333333|
|9  |null         |172   |11          |False        |117              |1   |0.0            |
|10 |contextid    |90    |2           |False        |15               |2   |0.3333333333333|
|11 |-            |170   |11          |False        |114              |1   |0.0            |
|12 |version-style|16    |0           |False        |13               |3   |0.6666666666666|
|13 |Sentence     |68    |10          |False        |59               |1   |0.0            |
|14 |String       |12    |0           |True         |11               |4   |1.0            |
|15 |Sentence     |173   |11          |False        |118              |1   |0.0            |
|16 |String       |12    |0           |True         |11               |4   |1.0            |
|17 |Sentence     |132   |8           |False        |96               |1   |0.0            |
|18 |String       |12    |0           |True         |11               |4   |1.0            |
|19 |contextid    |88    |2           |False        |0                |1   |0.0            |
+---+-------------+------+------------+-------------+-----------------+----+---------------+

now I want to apply inexpensive clustering algorithms like K-Means or etc. and see how it enhances the efficiency of clustering algorithm either checking distance computation or measure the size of the matrix before and after adding statistical pre-processing outputs which is Normalized_Freq column. in the reference, I addressed they used DBSCAN.

So in general the question is what is the best practice to check this issue? In other words, how can check the positive/negative effect of the new extracted column Normalized_Freq on clustering performance/efficiency? Can Clustering SHAP values explain this?

To find this I provided colab notebook so that we can find this matter using Python/PySpark.

Topic counts pyspark python clustering

Category Data Science

How can improve the performance of clustering algorithm concerning similar/same records?

About