What's the difference between data classification and clustering (from a Data point of view)

What are the differences and the similarities between data classification (using dedicated distance-based methods) and data clustering (which has certain defined methods such as k-means)

Is data classification a sub-topic of data clustering ?

Topic difference classification clustering

Category Data Science


Just to put together the good answers and comments and trying to answer more explicitly the part of the question about classification being a subtopic of clustering.

As pointed out by @ncasas from the view point of data, classification requires labeled data for training of the model (supervised learning) while clustering can make use of unlabeled data (unsupervised learning).

You can actually take a labeled dataset and use a clustering algorithm (you'd just discard the information contained in the labels). This would indeed produce a partition of samples in groups as a classification algorithm would do. However the result is not guaranteed to be the same nor similar (even if using the same number of partitions). This is because clustering algs try to build groups of samples that are similar to themselves and different to samples of other groups, while classification algs try to minimize some function of misclassification (how different the proposed partition is compared to that of the labels). As a simple example you can imagine a dataset of several face images from two individuals with different facial expressions (e.g. sad and happy); let's say you have labels for facial expression, you can do a classification and this will try to reproduce as good as possible the sad/happy label for each image; if you try to do clustering on the same data (without labels) using k=2 clusters you might found the two clusters to correspond to images of the two individuals (since images of the same face tend to be very similar).

Without entering into a debate of what constitute a "subtopic" I wanted to remark that clustering and classification are actually different in their objectives.


[Note: essentially my answer is the same as @ncasas, just an alternative phrasing]

Classification belongs to supervised learning whereas clustering belongs to unsupervised learning:

  • In supervised learning there is a training stage during which some instances (examples) are provided together with their answer (the target). During training the model "studies" all the examples in the training data (represented with features) in order to be able to find the target from the features. After it has been trained, the model can be applied to new instances and use their features to predict their target. In short the main characteristics of supervised learning are:
    • The goal is to predict a specific piece of information defined from the start (the target).
    • It requires some training data: features and answers for a large set of instances.
  • In unsupervised learning the goal is to discover the patterns within the data. There is no predefined target and no training stage (thus no need for annotated data). Unsupervised learning can only do general tasks based on comparing instances, such as clustering (grouping similar instances together) or ranking (ordering instances relatively to each other).

This is the fundamental difference between classification and clustering. Based on this understanding:

What's the difference between data classification and clustering (from a Data point of view)

From a strict data point of view, the difference is the requirement for annotated data in classication. There is no such requirement for clustering.

Is data classification a sub topic of data clustering ?

No because they belong to different families of ML which have different goals.

Example:

  • In spam classification (supervised task) a model is trained with some documents (usually emails) labelled as spam or not spam. The resulting model can predict whether a new document is spam or not.
  • In topic modelling (unsupervised task) a model groups semantically similar documents together, based on the words they contain.

The first task separates documents into classes, but these classes are predefined: here spam vs. non-spam. The model uses features specifically as indicators for this goal. It would use features in a completely different way if the classes were news vs. entertainment, business vs. personal, or sci-fi vs. romance. Hence the term supervised learning: the model focuses on what it is told (trained) to focus on.

Topic modelling separates documents into several clusters, but even if we assume exactly two clusters these are extremely unlikely to correspond to spam vs. non-spam (or news vs. entertainment, etc.). A clustering algorithm follows a neutral similarity method which uses the features indiscriminately. The main outcome are the clusters themselves, which represent unknown patterns in the data. For example applying topic modelling in a large collection of documents may lead to discover what are the main categories of documents: the new knowledge is the existence of these groups. Clustering is unsupervised because it doesn't follow a predetermined goal.


Classification is a problem where your input data consists of elements with 2 parts:

  1. Some data features that reflect the traits of an entity
  2. A label that assigns the entity to a group or class.

With that kind of data, you can train a model that receives the data features (first part) and generates the label (second part). This kind of training, where you train a system to generate some output when it receives a specific input is called "supervised learning".

On the other hand, in Clustering, your dataset only has the data features, that is, your dataset does not have the labels. Clustering methods allow you to group the entities in classes without having any labels, normally by defining a priori how many groups you want, and then grouping the entities by their similarity. This kind of training, where there are no labels and you have to learn just from the entity data features is called "unsupervised learning"

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.