K-means Clustering and its real use-case in the Security Domain

Ayushmilan
4 min readAug 12, 2021

Introduction to K-means Clustering

We learn about unsupervised learning , K-means Clustering is a type of unsupervised learning, which is used when we have data without defined categories or groups (i.e. unlabeled data). It is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster.

This algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. It tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

Method for k-mean Clustering

K-means algorithm executed by following the given steps :

  • At first every objects are partition into k non-empty subsets.
  • then identifying the cluster centroids of the partition.
  • now assign each point to a specific cluster.
  • Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
  • After re-allotting, find the centroid of the new cluster formed.

It’s Use-case in Security Domain

  • Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

  • Insurance fraud detection

Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

  • Cyber-profiling criminals

Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

  • Document classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. The document vectors are then clustered to help identify similarity in document groups.

These were few use cases of K-means Clustering in Security Domain, K-means is very effective as well as easy way of Clustering in Machine Learning.

THANK YOU :)

--

--

Ayushmilan

Associate Data Scientist, Technical Content Writer, GATE CSE(2022, 2023) Qualified