K-means Clustering for Anomaly Detection [Python Example]

I can provide you with an example of using k-means clustering for anomaly detection in Python. In this example, we’ll use the scikit-learn library, which provides various machine learning algorithms and tools. Make sure you have scikit-learn installed before running this code.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

# Generate some sample data
data = np.random.rand(100, 2)

# Create a k-means clustering model
kmeans = KMeans(n_clusters=3)

# Predict the closest cluster for each data point
closest_cluster = kmeans.predict(data)

# Calculate the distance of each data point to its closest cluster center
distances = pairwise_distances_argmin_min(data, kmeans.cluster_centers_)[1]

# Define a threshold to identify anomalies
threshold = np.percentile(distances, 95)

# Find the indices of the anomalies
anomaly_indices = np.where(distances > threshold)[0]

# Print the indices of the anomalies
print("Anomaly indices:", anomaly_indices)
Code language: Python (python)

In this example, we generate some random data with two dimensions (data). We create a KMeans object with n_clusters=3, which means we want to identify three clusters in the data. We fit the k-means model to the data using the fit method.

Next, we predict the closest cluster for each data point using the predict method. We calculate the distance of each data point to its closest cluster center using the pairwise_distances_argmin_min function.

After that, we define a threshold to identify anomalies. In this case, we use the 95th percentile of the distances as the threshold, which means any data point with a distance greater than this threshold is considered an anomaly.

Finally, we find the indices of the anomalies by comparing the distances to the threshold. The indices of the anomalies are stored in the anomaly_indices variable, which we print in the last line of the code.

Note that in real-world scenarios, you would typically use a more meaningful dataset and tune the parameters of the k-means algorithm based on your specific problem.

This example serves as a basic illustration of using k-means clustering for anomaly detection.

Can k-means be used for anomaly detection?

Yes, k-means can be used for anomaly detection and outlier detection, although it is not the most commonly used method for these tasks.

Can clustering be used for anomaly detection?

k-means is not specifically designed for anomaly detection, and there are other algorithms that may perform better in this regard, such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) or Isolation Forest.

Can k-means be used for outlier detection?

Not suitable.

Instead, algorithms such as Local Outlier Factor (LOF) or Isolation Forest are often used for outlier detection as they are specifically designed for this purpose.

Read More;

  • Dmytro Iliushko

    I am a middle python software engineer with a bachelor's degree in Software Engineering from Kharkiv National Aerospace University. My expertise lies in Python, Django, Flask, Docker, REST API, Odoo development, relational databases, and web development. I am passionate about creating efficient and scalable software solutions that drive innovation in the industry.

Leave a Comment