What does "S" in Shannon's entropy stands for?

I see many machine learning texts using the following notation to represent Shannon's entropy in classification/supervised learning contexts:

$$ H(S) = \sum_{i \in Y}p_i \log(p_i) $$

Where $p_i$ is the probability of a given point being of class $i$. I just do not understand what is $S$ because no further explanation about it is provided. Does it has something to do with the feature $S$ in the dateset?

$S$ seems to appear again in Information Gain formula:

$$ \operatorname{IG}(S,A) = H(S) - \sum_{a \in A} \frac{S_a}{S}H(S_a) $$

I know Information Gain and Entropy concepts, I just would like to understand the mathematical formalism.

Topic information-theory supervised-learning decision-trees classification

Category Data Science


To answer your question,

  • $S$ in shannon entropy represents a discrete random variable with values $s_{1},s_{2},..s_{n}$
  • $S$ in Information Gain represents set of training examples, in the form ${\displaystyle ({\textbf {s}},t)=(s_{1},s_{2},s_{3},...,s_{k},t)}$, where ${\displaystyle s_{a}\in vals(a)}$ is the value of the ${\displaystyle a^{\text{th}}}$ attribute or feature of example ${\displaystyle {\textbf {s}}}$ and $t$ is the class label.

Below is information from wikipedia

Shannon Entropy: wiki link

Given a discrete random variable $X$, with possible outcomes $x_{1} ,x_{2} ,....x_{n}$

, which occur with probability ${\displaystyle \mathrm {P} (x_{1}),...,\mathrm {P} (x_{n}),}{\displaystyle \mathrm {P} (x_{1}),...,\mathrm {P} (x_{n}),}$ the entropy of $X$ is formally defined as:

${\displaystyle \mathrm {H} (X)=-\sum _{i=1}^{n}{\mathrm {P} (x_{i})\log \mathrm {P} (x_{i})}}$

Information Gain:wiki link

Let ${\displaystyle T}$ denote a set of training examples, each of the form ${\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)}$ where ${\displaystyle x_{a}\in vals(a)}$ is the value of the ${\displaystyle a^{\text{th}}}$ attribute or feature of example ${\displaystyle {\textbf {x}}}$ and $y$ is the corresponding class label. The information gain for an attribute ${\displaystyle a}$ is defined in terms of Shannon entropy ${\displaystyle \mathrm {H} (-)}$ as follows. For a value ${\displaystyle v}$ taken by attribute ${\displaystyle a}$, let

${\displaystyle S_{a}{(v)}=\{{\textbf {x}}\in T|x_{a}=v\}}$ be defined as the set of training inputs of ${\displaystyle T}$ for which attribute ${\displaystyle a}$ is equal to ${\displaystyle v}$. Then the information gain of ${\displaystyle T}$ for attribute ${\displaystyle a}$ is the difference between the a priori Shannon entropy ${\displaystyle \mathrm {H} (T)}$ of the training set and the conditional entropy ${\displaystyle \mathrm {H} (T|a)}$ .

${\displaystyle \mathrm {H} (T|a)=\sum _{v\in vals(a)}{{\frac {|S_{a}{(v)}|}{|T|}}\cdot \mathrm {H} \left(S_{a}{\left(v\right)}\right)}.}$

${\displaystyle IG(T,a)=\mathrm {H} (T)-\mathrm {H} (T|a)}$

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.