Self-Attention Summation and Loss of Information

In self-attention, the attention for a word is calculated as: $$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>} $$ My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration? In other words, how does this summed vector point to which words are relevant? For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{<t>}$, and …
Category: Data Science

Feature selection with information gain (KL divergence) and mutual information yields different results

I'm comparing different techniques for feature selection / feature ranking. Two of the techniques under scrutiny are the mutual information (MI) and the information gain (IG) as used in decision trees, i.e. the Kullback-Leibler divergence. My data (class and features) is all binary. All sources I could find state, that MI and IG are basically "two sides of the same coin", i.e. that one can be tranformed into the oher via mathematical manipulation. (For example [source 1, source 2]) Yet, …
Category: Data Science

Shannon Information Content related to Uncertainty?

I'm a data scientist student currently writing my master thesis which resolves around the Cross Entropy (CE) Loss Function for neural networks. From my understanding, the CE is based on the Entropy, which in turn is based on the Shannon Information Content (SIC), however I struggle to interpret and explain it in such a way that my fellow students can understand it without using concepts of information theory (which itself is already a completely different and complicated area). In the …
Category: Data Science

How to calculate the information conveyed in a message for a given dataset

Given the data sets. Test Set Venue,color,Model,Category,Location,weight,Veriety,Material,Volume 1,6,4,4,4,1,1,1,6 2,5,4,4,4,2,6,1,1 1,6,2,1,4,1,4,2,4 1,6,2,1,4,1,2,1,2 2,6,5,5,5,2,2,1,2 1,5,4,4,4,1,6,2,2 1,3,3,3,3,1,6,2,2 Training Set Venue,color,Model,Category,Location,weight,Veriety,Material,Volume 2,6,4,4,4,2,2,1,1 1,2,4,4,4,1,6,2,6 1,5,4,4,4,1,2,1,6 2,4,4,4,4,2,6,1,4 1,4,4,4,4,1,2,2,2 2,4,3,3,3,2,1,1,1 1,5,2,1,4,1,6,2,6 1,2,3,3,3,1,2,1,6 2,6,4,4,4,2,3,1,1 I'd like to calculate the message conveyed/ Information Gained via MC = -p1*log2(p1)-p2*log(p2), where p1 and p2 are the probabilities of assigning class 1 or class 2. Ideally, I'd like to do this for n classes MC = -p1log2(p1) - p2*log2(p2)-...-pn*log2(pn) The step for this calculation is at Step 1 from numpy.core.defchararray import count …
Category: Data Science

Entropy loss from collapsing/merging two categories

Suppose I am counting occurrences in a sequence. For a classical example, let's say I'm counting how many of each kind of car comes down a highway. After keeping tally for a while, I see there are thousands of models. But only a handful show up frequently, whereas there are many that show up once or only a few times (iow the histogram resembles exponential decay). When thinking about the statistics of this situation, it hardly seems to matter that …
Category: Data Science

What does "S" in Shannon's entropy stands for?

I see many machine learning texts using the following notation to represent Shannon's entropy in classification/supervised learning contexts: $$ H(S) = \sum_{i \in Y}p_i \log(p_i) $$ Where $p_i$ is the probability of a given point being of class $i$. I just do not understand what is $S$ because no further explanation about it is provided. Does it has something to do with the feature $S$ in the dateset? $S$ seems to appear again in Information Gain formula: $$ \operatorname{IG}(S,A) = …
Category: Data Science

A measure of redundancy in mutual information

Mutual information quantifies to what degree $X$ decreases the uncertainty about $Y$. However, to my understanding, it does not quantify "in how many ways" $X$ decreases the uncertainty. E.g., consider the case where $X$ is a 3D vector, and consider $X_1=[Y,0,0]$ vs. $X_2 = [Y,Y^2, 3.5Y]$. Intuitively, $X_2$ contains "more information" about $Y$, or is more redundant with respect to $Y$, than $X_1$; but if I understand correctly, both have the same mutual information. Is there an alternative information-theoretic measure …
Category: Data Science

Visualizing mutual information of each convolution layer for image classification problem

I recently came across this paper where the author has proposed a compression based theory on understanding the layers of a DNN. In order to visualize what was going on the authors showed Figure 2 of this paper which is also shown as a video here. For my image classification problem I want to visualize the mutual information exactly in this format. Can someone kindly explain to me how to calculate this numerically for images passing through conv layers in …
Category: Data Science

Combining "expert-assigned labels" and "real-observed labels"?

Combining "expert-assigned labels" and "real-observed labels"? That is, if I have a data set, where it's possible to have labels that are "true observations" and also labels that are "the expert strongly believes that these features should result in this label". Then how should these be combined? Particularly, these do give different information, but they also contain, possibly, different problems. Real-observed labels are assumed to be or are true. However, they might not always exist, but instead there may exist …
Category: Data Science

The meaning of the difference of two entropy values

I want to understanding the meaning of the difference of two information entropy values. I have the following scenario. Let $x$ be a number of hours a user spend on some video sharing websites. Thus, we may have the sets: $X_{A} = \{x_1,x_2,\cdots,x_{n_A}\}$, and $X_{B} = \{x_1,x_2,\cdots,x_{n_B}\}$ that represent the number of hours the users of $A$ and $B$ spent on the websites $A$ and $B$, respectively. Now, we can calculate the Cumulative distribution function (CDF) probability values, described here, …
Category: Data Science

Given a sequence of inputs/outputs and a set of nodes that modify that input, can you find the topology of a graph?

I am working on a problem where I have to model a graph topology, where the nodes are logic/arithmetic operations that can be applied to the input. The network receives a multi-dimensional input, and returns a multi-dimensional processed output with less dimensions. The only things I have to do this are a set of input/output pairs and a set of probable nodes. The system itself follows a few constraints: The nodes are unique, so there is a limited amount of …
Category: Data Science

Information bottleneck and deep neural network

I learned about "the information bottleneck view of deep learning." But in a nutshell, what does this tell us? I don't see what the role is of depth in this approach as long as it is larger than 2 or 3. Is there a rigorous theory? Or just some hypothesis or heuristic explanations on deep neural net? I saw the author's talk on YouTube. But, probably my ignorance, I don't really get the main point and the implication is. I …
Category: Data Science

Calculating the entropy of a neural network

I am looking to calculate the information contained in a neural network. I am also looking to calculate the maximum information contained by any neural network in a certain amount of bits. These two measures should be comparable (as in I can compare whether my current neural network has reached the max or is less than the max and by how much). Information is relative, so I define it relative to the real a priori distribution of the data that …
Category: Data Science

How to measure the information of covariates in a ML task?

Background Recently, I do 2 different ML projects. One is lending club loan prediction, another is a pravite dataset in online experiment field to predict whether a customer will take the treatment. Both 2 tasks are binary classification with 100+Million observations and a hundred covariates. However, my model of lending club have a very high PR-ROC(0.86), which shows the good performance of model. My model of online experiment suffers, only have a 0.03 PR-AUC. The model is somekind useless. I …
Category: Data Science

Does Decision tree classifier calculate entropies before transforming categorical features using OneHotEncoder or transformation should be done

I am new to machine learning, and I've got to the point to drop out from it as online tutorials are pretty confusing as well. Entropy and Decision trees One of confusing tutorials was as the following: Another tutorial was pretty straightforward and comprehensive in term of how entropies and information gain is being calculated, but he didn't split the data. Where the instructor started the entropy calculations and ended up with the following tree: I did understand the calculation …
Category: Data Science

What (probabilistic models) can only output decisions when they are certain?

I'm basically looking for approaches, models, algorithms for the following situation (a fault diagnosis problem): I have an input set $\{x_i\}_{i \in \{1..m\}}$ with $n$ binary features of cases (think of "faults" or "alarms" that fired) and $k$ classes. Each case $x_i$ can belong to at least one class and at most $k$ (so I'm dealing with multi-label classification). Now some relations in the data set are utterly boring/uninformative (say, feature $a$ says "Mechanical Error occurred" and label $b$ means …
Category: Data Science

Conditional Entropy and Mutual Information - Clustering evaluation

First of all, I am doing clustering and I have the true labels for my data. For evaluation, I am using the weighted average of the entropy values for each predicted cluster. I also came across with Mutual Information as a similar approach while going over the alternatives. On my data, they seem to give similar results. However there is one issue that puzzles me. Given the predicted cluster set $U$ and true clusters $V$, mutual information was defined as: …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.