uppose I have a matrix and a dependent vector whose entries are each in {0,1} dependent on the corresponding row of Given this dataset, I'd like to learn a model, so that given some other dataset ′, I could predict average(′) of the dependent-variable vector ′. Note that I'm only interested in the response on the aggregate level of the entire dataset. One way of doing so would be to train a calibrated binary classifier →, apply it to ′, …
I’m trying to build a model to predict the amount of sales of a product for the next few days This question is about whether or not I should use the tail of the serie as the test set and train models using the rest of the data or I should create a test set picking dates at random as usual Reading about classical time series models (ARIMA), they recommend the first approach (using the last days as test) but …
I have read some articles and realized that many of them cited, for example, DLis better for large amount of data than ML. Typically: The performance of machine learning algorithms decreases as the number of data increases Source Another one says the performance of ML models will plateau, Source As far as I understand, the more data, the better. It helps us implement complex models without overfitting as well as the algorithms learn the data better, thus inferring decent patterns …
(Note: My question resolves about a problem stated in the following lecture video: https://youtu.be/ERL17gbbSwo?t=413 Hi, I hope this is the right forum for these kind of questions. I'm currently following the lectures of geometric deep learning from (geometricdeeplearning.com) and find the topics fascinating. As I want to really dive in I wanted to also follow up on the questions they state towards the students. In particular my question revolves around creating invariant functions using the G-Smoothing operator (To enforce invariance, …
I have seen classification CNNs that train with numerous images for a subset of labels (i.e. Number of images >> Number of labels), however, is it still possible to use CNNs when the number of images = Number of labels? specifically consider: having N settings that you can control to generate a unique image. Is it possible to make a CNN that can describe the mapping? (Is CNN the right architecture to use?)
In the LightGBM paper the authors make use of a newly developed sampling method GOSS to reduce the number of data instances needed for finding the best split of a given feature in a tree-node. They give an error estimation for the error made by sampling instead of taking the entire data (Theorem 3.2 in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf) I am interested in the proof of this Theorem for which the paper refers to "supplementary materials" Where can I find those?
So let's say I have data with numerical variables A, B and C. I believe that the value of a has an effect on B. I also believe that A and B both have an effect on C. I don't think C has an effect on either A or B. I want to use machine learning to predict A, B and C. I obviously have A and B as training data, and I have other variables as training data too. …
Let's say our data is discrete-valued and belongs to one of $K$ classes. The underlying probability distribution is assumed to be a categorical/multinoulli distribution given as $p(\textbf{x}) = \prod_{k = 1}^{K}\mu_{k}^{x_{k}}$ where x is a one-hot vector given as $\textbf{x} = [x_{1} x_{2} ... x_{K}]^{T}$ and $\boldsymbol{\mu} = [\mu_{1} ... \mu_{K} ]^{T}$ are the parameters. Suppose $D = \{\mathbf{x}_{1}, \text{ } \mathbf{x}_{2}, \text{ } ... ,\text{ }\mathbf{x}_{N}\}$ is our data. The log likelihood is: $\log p(D|\boldsymbol{\mu}) = \sum_{k = 1}^{K} …
The definition of inductive bias says that The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered. The inductive bias of Candidate elimination says that The target concept c is contained in the given hypothesis space H My question is , how does this inductive bias help us to predict for next instance in given dataset?
This is the first time I have posted here. I am looking for some feedback or perspective on this question. To make it simple, let's just talk about linear models. We know the MLE solution for the $l_1$ loss objective is the same as the Bayesian MAP estimate with a Laplace prior for each parameter. I'll show it here for convenience. For vector $Y$ with $n$ observations, matrix $X$, parameters $\beta$, and noise $\epsilon$ $$Y = X\beta + \epsilon,$$ the …
when studying kernel methods a few years ago I got a bit confused with the concepts of feature space, hypothesis space and reproducing kernel Hilbert space. Recently, I thought a little about questions that I asked myself back then (with newly acquired math background) and noticed that some things are still unclear to me. I appreciate help and pointers to good - mathematical - literature. Let's consider the following learning problem: We are given a training sample $((x_1, y_1), \dots, …
I am interested in fitting data (regression rather than classification) with individual targets which are vectors via an XGBoost type model. However, currently Python's xgboost.XGBRegressor model only supports scalar targets. Looking at the original paper defining the algorithm, it seems possible we could just extend their methods using a vectorized form: Paper here Following their notation, if one simply assumed that $f_t(x_i)$ is a vector in $\mathbb{R}^k$ I think the multi-dimensional analogue of equation (6) would be something like: $$\tilde{\mathcal{L}}^{(t)}(q) …
We're using a whole year's data to predict a certain target variable.The model works like data - OneHot encoding the categorical variables - MinMaxScaler - PCA (to choose a subset of 2000 components out of the 15k) - MLPRegressor. When we're doing a ShuffleSplit cross-validation and everything is hunky-dory (r^2 scores above 0.9 and low error rates), however in real life, they're not going to use the data in the same format (e.g. a whole year's data), but rather a …
Let's say you have a dataset, and you split it into 80% training and 20% testing. Naturally, you want to find the optimal hyperparameters for your model, so with the training set, you plan to do cross validation and search parameter space. CatBoost has something called the eval set which is used to help avoid overfitting, but I have a fundamental question on how to use it appropriately. Say you do CV10. So now we have 10 iterations where 90% …
I've read a book chapter that walks you through all the steps involved in an end-to-end machine learning project. After doing all the practical exercises I'm still not quite sure that my way of thinking about the whole process is right. I've tried to depict it in the following flowchart: Is this the right way of thinking about all the steps in an ML project? Is there something missing?
Imagine that we count on infinite computation power, an infinite amount of data and we have an infinite amount of time to wait for a model to learn. In such a scenario, we want to have some data binary classified. My question is: would all classification models (we can leave out linear models because they won't be able to learn non-linear boundaries) perform similarly? In other words, are all the (in principle) solvable problems by each (non-linear) classification algorithm the …
MNIST dataset with 60 000 training samples and 10 000 test samples. Neural network #1. Accuracy on the training set: 99.53%. Accuracy on the test set: 99.31%. Neural network #2. Accuracy on the training set: 100.0%. Accuracy on the test set: 99.19%. Which neural network is better if other parameters are unknown? I have seen how many studies focus on accuracy on a test set, and rarely write about accuracy on a training set. The first neural network is better …
I might be in danger of having my question closed as "not clear what I'm asking for," but here goes. Suppose we have a simple feedforward network. It has a few layers, each layer has a "reasonable" number of neurons, nothing complicated. Let's say the output has size $n$, and there is no final activation function on the output. The network will have an "easier" time training to produce some outputs relative to others. In particular, outputs close to 0, …
Given M binary variables and R samples, what is the maximum number of leaves in a decision tree? My first assumption was that the worst case would be a leaf for each sample, thus R leaves maximum. Am I wrong and there should be a kind of connection with the number of variables M? I know that the maximum depth of a decision tree is M as a variable can appear once in a branch, but I don't see the …
I'm trying to predict a list of numbers, e.g: [23,55,198,200,64] The data I have includes multiple things, along with: The numbers from the previous run (These numbers come from scientific experiments) A list of all previous lists of numbers So for example if two runs ago we got [22,24,77,187,21], and the run after that we got [90,22,76,88,29], we would now have a list of [[22,24,77,187,21],[90,22,76,88,29]] The important thing is that it doesn't matter what order the numbers are in. [22,24,77,187,21] …