Intro There are several questions on this site about whether or not machine learning can solve specific problems. The answer (in my words) seems to be: "Yes, trivially, if you choose a model to learn your specific problem, but you sometimes may choose a model that can't represent/approximate the correct hypothesis." I would like to choose a neural network model where, a priori, all I know is that the input is a "linear algebra" kind of function. The Problem I …
I am finding it hard to understand the clear difference between Hypothesis and Hyperplane. I know that Hypothesis is a candidate model that maps inputs to outputs after training. And, Hyperplane is the decision boundary in a classification algorithm. But, I can't seem to understand how the two are differentiated in equations. Can someone help me understand their differences in equations with some visualizations?
I'm dealing with modeling small experimental data sets. As most experimental work does not generate thousands of samples, but rather a handful, I need to be inventive about how to deal with this small number of data sets (say 10-20). I've been building a nice framework to do just this, and at this point, I am interested in generating error bars with the predicted values. In a rough outline, this is what happens in the framework (e.g. when applying a …
I'm working on a problem which is a multiple equation. I have a group of people and each person in the group is working on different tasks (e.g. n tasks in total). Each person in this group is working on multiple tasks and complete them. I'd like to find an estimation for the time each type of task takes. I have equations like below: #of days person i worked = time(task1) * #task of type 1 completed + time(task2) * …
I'm studying PCA and my professor said something about finding the linear regression by doing the dot product of both axis. Could someone explain to me why? The dot product returns a number. What's the relationship between that number and the linear regression? In my example, I have two vectors $stat\_grade = [0,1,3,7,10]$ $physics\_grade = [1,5,8,10,10]$ The first step is normalizing them: $ \frac{stat\_grade - mean(stat\_grade)}{std(stat\_grade)} = [-1.69131435 -0.52489066 0.34992711 0.93313895 0.93313895]$ $ \frac{physics\_grade - mean(physics\_grade)}{std(physics\_grade)} = [-1.11613741 -0.85039041 -0.3188964 …
So I recently started with Andrew Ng's ML Course and this is the formula that Andrew lays out for calculating gradient descent on a linear model. $$ \theta_j = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$$ As we see, the formula asks us to the sum over all the rows in data. However, the below code doesn't work if I apply np.sum() def gradientDescent(X, y, theta, alpha, num_iters): …
I have a set of values for a surface (in pixels) that becomes bigger over time (exponentially). The surface consists of cells that divide over time. After doing some modelling, I came up with the following formula: $$S(t)=S_{initial}2^{t/a_d},$$ where $a_d$ is the age at which the cell divides. $S_{initial}$ is known. I am trying to estimate $a_d$. I simply tried the $\chi^2$ test: # Range of ages of division. a_range = np.linspace(1, 500, 100) # Set up an empty vector …
I'm trying to create a NN whose input is a (length m) array of 3d vectors $$\vec{x}_i = [x_{i,1},x_{i,2},x_{i,3}], \hspace{5mm}i=1:m $$ and whose output is a similarly sized array: $$\vec{h}_{\theta,i} = [h_{\theta,i1},h_{\theta,i2},h_{\theta,i3}], \hspace{5mm}i=1:m $$ BUT, my only training data is not 3d vectors but rather the magnitude/norm of such vectors (with no knowledge of the vector components ($\lambda's$) themselves): $$y_i= ||[\lambda_{i,1},\lambda_{i,2},\lambda_{i,3}]||, \hspace{5mm}i=1:m $$ So, my concept is to use the cost function: $$ J = \frac{1}{2m}\sum (||\vec{h}_{\theta,i}|| - ||y_i||)^2 $$ …
I am trying to perform Face recognition using PCA (eigenfaces). I have a set of N training images (of dimensions M=wxh), which I have pre-processed into a vertical stack of grayscale intensity vectors, a matrix of dimensions NxM. For the facial recognition, I am finding the single nearest neighbour of each test image in both the high-dimensional pixel space and the lower dimensional eigenspace. I am using NearestNeighbor classifier from sklearn. For recognition in the eigenspace, I am contrasting different …
Suppose a support vector machine for separating pluses from minus finds a support vector at point (1,0) and a minus support vector at x2=(0,1). Determine the values of w and b.
The book Deep Learning by Ian Goodfellow states that: Linear models also have the obvious defect that the model capacity is limited to linear functions, so the model cannot understand the interaction between any two input variables. What is meant by "interaction between variables" How do non linear models find it? Would be great if someone can give an intuitive/graphical/geometrical explanation.
I have been trying to understand the convolution lowering operation shown in the cuDNN paper. I was able to understand most of it by reading through and mapping various parameters to the image below. However, I am unable to understand how the original input data (NCHW) was converted into the Dm matrix shown in red. The ordering of the elements of the Dm matrix does not make sense. Can someone please explain this?
I've been looking for methods to compute a pseudo inverse of a covariance matrix. And found that one way is to construct a regularized inverse matrix. By constructing the eigen system, and removing the least significant eigenvalues and then use the eigen values and vectors to form an approximate inverse. Could anyone explain the idea behind this? Thanks in advance
I can follow classical linear regression steps: $Xw=y$ $X^{-1}Xw=X^{-1}y$ $Iw=X^{-1}y$ $w=X^{-1}y$ However, on implementing in Python, I see that instead of simply using w = inv(X).dot(y) they apply w = inv(X.T.dot(X)).dot(X.T).dot(y) What is the explanation of the transpositions and the two times multiplication here? I'm confused...
As a clarifier, I want to implement cross-correlation, but the machine learning literature keeps referring to it as convolution so I will stick with it. I am trying to implement image convolution using linear algebra. After looking around on the internet and thinking about it, I could come up with two possible solutions for that. The first one: Create an appropriate Toeplitz-like matrix out of the kernel as it is described here. The second one: Instead of the filter, modify …
I have asked this question in Mathematics Stackexchange, thought however that it might be more fit for here: I am currently taking a Data-Analysis course and I learned about both the terms LDA (Linear Discriminant Analysis) and FDA (Fisher's Discriminant Analysis). I almost have the feeling that they are used as somewhat of synonyms in some places, which obviously is not true. Can someone explain me how those approaches are related? Since LDA's aim is to reduce dimensionality while preserving …
Background I currently read a book called "Mathematics for Machine Learning" and I read chapter 2 which is about Linear Algebra, especially on subchapter 2.8 which is about Affine Space. The thing is, I learned from the book that affine subspaces are points, lines, and plane in $ \mathbb{R}^{3} $, which don't (necessarily) go through the origin. The affine subspace is defined as $$ L = x_{0} + \lambda b_{1} $$ where: $L$ is affine subspace $x_{0}$ is a support …
I was trying to understand Lagrangian from SVM section of Andrew Ng's Stanford CS229 course notes. On page 17 and 18, he says: Given the problem $$\begin{align} min_w & \quad f(w) \\ s.t. & \quad h_i(w)=0, i=1,...,l \end{align}$$, the Lagrangian can be given as follows: $$\mathcal{L}(w,\beta)=f(w)\color{red}{+}\sum_{i=1}^l\beta_ih_i(w)\quad\quad\quad \text{...equation(1)}$$ Here, the $\beta_i$'s are Lagrange multipliers. While referring to Lagrange multipliers from Khan academy aryicle, I found it says: Lagrangian is given as: $$ \mathcal{L}(x,y,…,λ)=f(x,y,…)\color{red}{−}λ(g(x,y,…)−c) \quad\quad\quad \text{...equation(2)}$$ Here, $g$ is a constraint and …
I was referring SVM section of Andrew Ng's course notes for Stanford CS229 Machine Learning course. On pages 14 and 15, he says: Consider the picture below: How can we find the value of $\gamma^{(i)}$? Well, $w/\Vert w\Vert$ is a unit-length vector pointing in the same direction as $w$. Since, point $A$ represents $x^{(i)}$, we therefore find that the point $B$ is given by $x^{(i)} − \gamma^{(i)}·w/\Vert w\Vert$. But this point lies on the decision boundary, and all points $x$ …