information-theory

Self-Attention Summation and Loss of Information

Jozdien

2022年5月31日 17:03

In self-attention, the attention for a word is calculated as: $$ A(q, K, V) = \sum_{i} \frac{exp(q.k^{<i>})}{\sum_{j} exp(q.k^{<j>})}v^{<i>} $$ My question is why we sum over the Softmax*Value vectors. Doesn't this lose information about which other words in particular are important to the word under consideration? In other words, how does this summed vector point to which words are relevant? For example, consider two extreme scenarios where practically the entire output depends on the attention vector of word $x^{<t>}$, and …

Topic: transformer attention-mechanism information-theory deep-learning

Category: Data Science

Feature selection with information gain (KL divergence) and mutual information yields different results

AutoMiner

2022年3月16日 19:01

I'm comparing different techniques for feature selection / feature ranking. Two of the techniques under scrutiny are the mutual information (MI) and the information gain (IG) as used in decision trees, i.e. the Kullback-Leibler divergence. My data (class and features) is all binary. All sources I could find state, that MI and IG are basically "two sides of the same coin", i.e. that one can be tranformed into the oher via mathematical manipulation. (For example [source 1, source 2]) Yet, …

Topic: mutual-information information-theory ranking feature-selection

Category: Data Science

Shannon Information Content related to Uncertainty?

xflashx

2021年12月10日 20:15

I'm a data scientist student currently writing my master thesis which resolves around the Cross Entropy (CE) Loss Function for neural networks. From my understanding, the CE is based on the Entropy, which in turn is based on the Shannon Information Content (SIC), however I struggle to interpret and explain it in such a way that my fellow students can understand it without using concepts of information theory (which itself is already a completely different and complicated area). In the …

Topic: cross-entropy information-theory probability loss-function

Category: Data Science

Why the number of neurons or convolutions chosen equal powers of two?

Roosh

2021年11月29日 19:15

In the overwhelming number of works devoted to the neural networks, the authors suggest arhitechure in which each layer is a numbers of neurons is power of 2 what are the theoretical reasons(prerequisite) for this choice?

Topic: information-theory neural-network machine-learning

Category: Data Science

When should I use Gini Impurity as opposed to Information Gain (Entropy)?

Krish Mahajan

2021年11月26日 23:01

Can someone practically explain the rationale behind Gini impurity vs Information gain (based on Entropy)? Which metric is better to use in different scenarios while using decision trees?

Topic: information-theory decision-trees machine-learning

Category: Data Science

How to calculate the information conveyed in a message for a given dataset

Evan Gertis

2021年11月26日 20:12

Given the data sets. Test Set Venue,color,Model,Category,Location,weight,Veriety,Material,Volume 1,6,4,4,4,1,1,1,6 2,5,4,4,4,2,6,1,1 1,6,2,1,4,1,4,2,4 1,6,2,1,4,1,2,1,2 2,6,5,5,5,2,2,1,2 1,5,4,4,4,1,6,2,2 1,3,3,3,3,1,6,2,2 Training Set Venue,color,Model,Category,Location,weight,Veriety,Material,Volume 2,6,4,4,4,2,2,1,1 1,2,4,4,4,1,6,2,6 1,5,4,4,4,1,2,1,6 2,4,4,4,4,2,6,1,4 1,4,4,4,4,1,2,2,2 2,4,3,3,3,2,1,1,1 1,5,2,1,4,1,6,2,6 1,2,3,3,3,1,2,1,6 2,6,4,4,4,2,3,1,1 I'd like to calculate the message conveyed/ Information Gained via MC = -p1*log2(p1)-p2*log(p2), where p1 and p2 are the probabilities of assigning class 1 or class 2. Ideally, I'd like to do this for n classes MC = -p1log2(p1) - p2*log2(p2)-...-pn*log2(pn) The step for this calculation is at Step 1 from numpy.core.defchararray import count …

Topic: information-theory data-mining

Category: Data Science

Entropy loss from collapsing/merging two categories

Whelibeiren

2021年11月8日 14:02

Suppose I am counting occurrences in a sequence. For a classical example, let's say I'm counting how many of each kind of car comes down a highway. After keeping tally for a while, I see there are thousands of models. But only a handful show up frequently, whereas there are many that show up once or only a few times (iow the histogram resembles exponential decay). When thinking about the statistics of this situation, it hardly seems to matter that …

Topic: information-theory categorical-data

Category: Data Science

What does "S" in Shannon's entropy stands for?

heresthebuzz

2021年10月20日 21:56

I see many machine learning texts using the following notation to represent Shannon's entropy in classification/supervised learning contexts: $$ H(S) = \sum_{i \in Y}p_i \log(p_i) $$ Where $p_i$ is the probability of a given point being of class $i$. I just do not understand what is $S$ because no further explanation about it is provided. Does it has something to do with the feature $S$ in the dateset? $S$ seems to appear again in Information Gain formula: $$ \operatorname{IG}(S,A) = …

Topic: information-theory supervised-learning decision-trees classification

Category: Data Science

A measure of redundancy in mutual information

user1767774

2021年7月17日 07:50

Mutual information quantifies to what degree $X$ decreases the uncertainty about $Y$. However, to my understanding, it does not quantify "in how many ways" $X$ decreases the uncertainty. E.g., consider the case where $X$ is a 3D vector, and consider $X_1=[Y,0,0]$ vs. $X_2 = [Y,Y^2, 3.5Y]$. Intuitively, $X_2$ contains "more information" about $Y$, or is more redundant with respect to $Y$, than $X_1$; but if I understand correctly, both have the same mutual information. Is there an alternative information-theoretic measure …

Topic: mutual-information information-theory descriptive-statistics

Category: Data Science

Visualizing mutual information of each convolution layer for image classification problem

Ashesh

2021年7月13日 00:01

I recently came across this paper where the author has proposed a compression based theory on understanding the layers of a DNN. In order to visualize what was going on the authors showed Figure 2 of this paper which is also shown as a video here. For my image classification problem I want to visualize the mutual information exactly in this format. Can someone kindly explain to me how to calculate this numerically for images passing through conv layers in …

Topic: cnn mutual-information information-theory

Category: Data Science

Combining "expert-assigned labels" and "real-observed labels"?

mavavilj

2021年7月11日 09:33

Combining "expert-assigned labels" and "real-observed labels"? That is, if I have a data set, where it's possible to have labels that are "true observations" and also labels that are "the expert strongly believes that these features should result in this label". Then how should these be combined? Particularly, these do give different information, but they also contain, possibly, different problems. Real-observed labels are assumed to be or are true. However, they might not always exist, but instead there may exist …

Topic: labels information-theory

Category: Data Science

The meaning of the difference of two entropy values

M.M

2021年7月3日 20:24

I want to understanding the meaning of the difference of two information entropy values. I have the following scenario. Let $x$ be a number of hours a user spend on some video sharing websites. Thus, we may have the sets: $X_{A} = \{x_1,x_2,\cdots,x_{n_A}\}$, and $X_{B} = \{x_1,x_2,\cdots,x_{n_B}\}$ that represent the number of hours the users of $A$ and $B$ spent on the websites $A$ and $B$, respectively. Now, we can calculate the Cumulative distribution function (CDF) probability values, described here, …

Topic: information-theory

Category: Data Science

Given a sequence of inputs/outputs and a set of nodes that modify that input, can you find the topology of a graph?

Althis

2021年6月25日 16:28

I am working on a problem where I have to model a graph topology, where the nodes are logic/arithmetic operations that can be applied to the input. The network receives a multi-dimensional input, and returns a multi-dimensional processed output with less dimensions. The only things I have to do this are a set of input/output pairs and a set of probable nodes. The system itself follows a few constraints: The nodes are unique, so there is a limited amount of …

Topic: information-theory graphs machine-learning

Category: Data Science

Information bottleneck and deep neural network

induction601

2021年6月5日 10:02

I learned about "the information bottleneck view of deep learning." But in a nutshell, what does this tell us? I don't see what the role is of depth in this approach as long as it is larger than 2 or 3. Is there a rigorous theory? Or just some hypothesis or heuristic explanations on deep neural net? I saw the author's talk on YouTube. But, probably my ignorance, I don't really get the main point and the implication is. I …

Topic: information-theory deep-learning neural-network

Category: Data Science

Calculating the entropy of a neural network

donkey

2020年7月22日 23:41

I am looking to calculate the information contained in a neural network. I am also looking to calculate the maximum information contained by any neural network in a certain amount of bits. These two measures should be comparable (as in I can compare whether my current neural network has reached the max or is less than the max and by how much). Information is relative, so I define it relative to the real a priori distribution of the data that …

Topic: information-theory deep-learning machine-learning

Category: Data Science

In CS231n lecture, can't the linear classifier be softmax itself?

Ashutosh Mishra

2020年5月8日 15:00

I am little bit confused on why the scoring function that is the $f(X,W)$ is chosen to be $W,X$ while they talk about Softmax and SVM loss in this. Can't they take Softmax classifier or SVM classifier and then explain the losses? Was there a particular need of taking the above mentioned scoring function?

Topic: information-theory deep-learning machine-learning

Category: Data Science

How to measure the information of covariates in a ML task?

Travis

2019年12月21日 08:28

Background Recently, I do 2 different ML projects. One is lending club loan prediction, another is a pravite dataset in online experiment field to predict whether a customer will take the treatment. Both 2 tasks are binary classification with 100+Million observations and a hundred covariates. However, my model of lending club have a very high PR-ROC(0.86), which shows the good performance of model. My model of online experiment suffers, only have a 0.03 PR-AUC. The model is somekind useless. I …

Topic: information-theory machine-learning

Category: Data Science

Does Decision tree classifier calculate entropies before transforming categorical features using OneHotEncoder or transformation should be done

alim1990

2019年10月1日 12:56

I am new to machine learning, and I've got to the point to drop out from it as online tutorials are pretty confusing as well. Entropy and Decision trees One of confusing tutorials was as the following: Another tutorial was pretty straightforward and comprehensive in term of how entropies and information gain is being calculated, but he didn't split the data. Where the instructor started the entropy calculations and ended up with the following tree: I did understand the calculation …

Topic: information-theory decision-trees python machine-learning

Category: Data Science

What (probabilistic models) can only output decisions when they are certain?

ks.and1

2019年9月10日 10:22

I'm basically looking for approaches, models, algorithms for the following situation (a fault diagnosis problem): I have an input set $\{x_i\}_{i \in \{1..m\}}$ with $n$ binary features of cases (think of "faults" or "alarms" that fired) and $k$ classes. Each case $x_i$ can belong to at least one class and at most $k$ (so I'm dealing with multi-label classification). Now some relations in the data set are utterly boring/uninformative (say, feature $a$ says "Mechanical Error occurred" and label $b$ means …

Topic: information-theory multilabel-classification

Category: Data Science

Conditional Entropy and Mutual Information - Clustering evaluation

ismlhk

2019年4月19日 19:17

First of all, I am doing clustering and I have the true labels for my data. For evaluation, I am using the weighted average of the entropy values for each predicted cluster. I also came across with Mutual Information as a similar approach while going over the alternatives. On my data, they seem to give similar results. However there is one issue that puzzles me. Given the predicted cluster set $U$ and true clusters $V$, mutual information was defined as: …

Topic: mutual-information information-theory evaluation clustering

Category: Data Science

About