Learning the Average of a 0/1 Dependent Variable

uppose I have a matrix and a dependent vector whose entries are each in {0,1} dependent on the corresponding row of Given this dataset, I'd like to learn a model, so that given some other dataset ′, I could predict average(′) of the dependent-variable vector ′. Note that I'm only interested in the response on the aggregate level of the entire dataset. One way of doing so would be to train a calibrated binary classifier →, apply it to ′, …
Category: Data Science

Time series test data dilema

I’m trying to build a model to predict the amount of sales of a product for the next few days This question is about whether or not I should use the tail of the serie as the test set and train models using the rest of the data or I should create a test set picking dates at random as usual Reading about classical time series models (ARIMA), they recommend the first approach (using the last days as test) but …
Category: Data Science

Why do the performance of DL models increase with the volume of data while that of ML models will flat out or even decrease?

I have read some articles and realized that many of them cited, for example, DLis better for large amount of data than ML. Typically: The performance of machine learning algorithms decreases as the number of data increases Source Another one says the performance of ML models will plateau, Source As far as I understand, the more data, the better. It helps us implement complex models without overfitting as well as the algorithms learn the data better, thus inferring decent patterns …
Topic: theory
Category: Data Science

Geometric Deep Learning - G-Smoothing operator on polynomials

(Note: My question resolves about a problem stated in the following lecture video: https://youtu.be/ERL17gbbSwo?t=413 Hi, I hope this is the right forum for these kind of questions. I'm currently following the lectures of geometric deep learning from (geometricdeeplearning.com) and find the topics fascinating. As I want to really dive in I wanted to also follow up on the questions they state towards the students. In particular my question revolves around creating invariant functions using the G-Smoothing operator (To enforce invariance, …
Category: Data Science

Creating a map between N images and N labels using CNN

I have seen classification CNNs that train with numerous images for a subset of labels (i.e. Number of images >> Number of labels), however, is it still possible to use CNNs when the number of images = Number of labels? specifically consider: having N settings that you can control to generate a unique image. Is it possible to make a CNN that can describe the mapping? (Is CNN the right architecture to use?)
Category: Data Science

Proof of GOSS algorithm in lightGBM paper

In the LightGBM paper the authors make use of a newly developed sampling method GOSS to reduce the number of data instances needed for finding the best split of a given feature in a tree-node. They give an error estimation for the error made by sampling instead of taking the entire data (Theorem 3.2 in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf) I am interested in the proof of this Theorem for which the paper refers to "supplementary materials" Where can I find those?
Category: Data Science

Use of multiple models vs training a single model for multiple outputs

So let's say I have data with numerical variables A, B and C. I believe that the value of a has an effect on B. I also believe that A and B both have an effect on C. I don't think C has an effect on either A or B. I want to use machine learning to predict A, B and C. I obviously have A and B as training data, and I have other variables as training data too. …
Category: Data Science

How to get the maximum likelihood estimate of the categorical distribution parameters using Lagrange optimization?

Let's say our data is discrete-valued and belongs to one of $K$ classes. The underlying probability distribution is assumed to be a categorical/multinoulli distribution given as $p(\textbf{x}) = \prod_{k = 1}^{K}\mu_{k}^{x_{k}}$ where x is a one-hot vector given as $\textbf{x} = [x_{1} x_{2} ... x_{K}]^{T}$ and $\boldsymbol{\mu} = [\mu_{1} ... \mu_{K} ]^{T}$ are the parameters. Suppose $D = \{\mathbf{x}_{1}, \text{ } \mathbf{x}_{2}, \text{ } ... ,\text{ }\mathbf{x}_{N}\}$ is our data. The log likelihood is: $\log p(D|\boldsymbol{\mu}) = \sum_{k = 1}^{K} …
Category: Data Science

Explanation of inductive bias of Candidate Elimination Algorithm

The definition of inductive bias says that The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered. The inductive bias of Candidate elimination says that The target concept c is contained in the given hypothesis space H My question is , how does this inductive bias help us to predict for next instance in given dataset?
Category: Data Science

Lasso (or Ridge) vs Bayesian MAP

This is the first time I have posted here. I am looking for some feedback or perspective on this question. To make it simple, let's just talk about linear models. We know the MLE solution for the $l_1$ loss objective is the same as the Bayesian MAP estimate with a Laplace prior for each parameter. I'll show it here for convenience. For vector $Y$ with $n$ observations, matrix $X$, parameters $\beta$, and noise $\epsilon$ $$Y = X\beta + \epsilon,$$ the …
Category: Data Science

Is the hypothesis space spanned by kernel evaluations on datapoints equivalent to the hypothesis space of linear functionals in the feature space?

when studying kernel methods a few years ago I got a bit confused with the concepts of feature space, hypothesis space and reproducing kernel Hilbert space. Recently, I thought a little about questions that I asked myself back then (with newly acquired math background) and noticed that some things are still unclear to me. I appreciate help and pointers to good - mathematical - literature. Let's consider the following learning problem: We are given a training sample $((x_1, y_1), \dots, …
Category: Data Science

Can XGBoost support vector outputs?

I am interested in fitting data (regression rather than classification) with individual targets which are vectors via an XGBoost type model. However, currently Python's xgboost.XGBRegressor model only supports scalar targets. Looking at the original paper defining the algorithm, it seems possible we could just extend their methods using a vectorized form: Paper here Following their notation, if one simply assumed that $f_t(x_i)$ is a vector in $\mathbb{R}^k$ I think the multi-dimensional analogue of equation (6) would be something like: $$\tilde{\mathcal{L}}^{(t)}(q) …
Category: Data Science

Structuring experiment/training data with months in mind

We're using a whole year's data to predict a certain target variable.The model works like data - OneHot encoding the categorical variables - MinMaxScaler - PCA (to choose a subset of 2000 components out of the 15k) - MLPRegressor. When we're doing a ShuffleSplit cross-validation and everything is hunky-dory (r^2 scores above 0.9 and low error rates), however in real life, they're not going to use the data in the same format (e.g. a whole year's data), but rather a …
Category: Data Science

How to use the eval set in catboost appropriately?

Let's say you have a dataset, and you split it into 80% training and 20% testing. Naturally, you want to find the optimal hyperparameters for your model, so with the training set, you plan to do cross validation and search parameter space. CatBoost has something called the eval set which is used to help avoid overfitting, but I have a fundamental question on how to use it appropriately. Say you do CV10. So now we have 10 iterations where 90% …
Category: Data Science

End-to-end machine learning project processes

I've read a book chapter that walks you through all the steps involved in an end-to-end machine learning project. After doing all the practical exercises I'm still not quite sure that my way of thinking about the whole process is right. I've tried to depict it in the following flowchart: Is this the right way of thinking about all the steps in an ML project? Is there something missing?
Category: Data Science

Would all classification models perform similarly in a theoretical and ideal scenario?

Imagine that we count on infinite computation power, an infinite amount of data and we have an infinite amount of time to wait for a model to learn. In such a scenario, we want to have some data binary classified. My question is: would all classification models (we can leave out linear models because they won't be able to learn non-linear boundaries) perform similarly? In other words, are all the (in principle) solvable problems by each (non-linear) classification algorithm the …
Category: Data Science

Which neural network is better?

MNIST dataset with 60 000 training samples and 10 000 test samples. Neural network #1. Accuracy on the training set: 99.53%. Accuracy on the test set: 99.31%. Neural network #2. Accuracy on the training set: 100.0%. Accuracy on the test set: 99.19%. Which neural network is better if other parameters are unknown? I have seen how many studies focus on accuracy on a test set, and rarely write about accuracy on a training set. The first neural network is better …
Category: Data Science

Theoretical basis for neural network "effort"

I might be in danger of having my question closed as "not clear what I'm asking for," but here goes. Suppose we have a simple feedforward network. It has a few layers, each layer has a "reasonable" number of neurons, nothing complicated. Let's say the output has size $n$, and there is no final activation function on the output. The network will have an "easier" time training to produce some outputs relative to others. In particular, outputs close to 0, …
Category: Data Science

Given M binary variables and R samples, what is the maximum number of leaves in a decision tree?

Given M binary variables and R samples, what is the maximum number of leaves in a decision tree? My first assumption was that the worst case would be a leaf for each sample, thus R leaves maximum. Am I wrong and there should be a kind of connection with the number of variables M? I know that the maximum depth of a decision tree is M as a variable can appear once in a branch, but I don't see the …
Category: Data Science

How to input a list into my model and not have it care about order

I'm trying to predict a list of numbers, e.g: [23,55,198,200,64] The data I have includes multiple things, along with: The numbers from the previous run (These numbers come from scientific experiments) A list of all previous lists of numbers So for example if two runs ago we got [22,24,77,187,21], and the run after that we got [90,22,76,88,29], we would now have a list of [[22,24,77,187,21],[90,22,76,88,29]] The important thing is that it doesn't matter what order the numbers are in. [22,24,77,187,21] …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.