gradient descent diverges extremely

I have manually created a random data set around some mean value and I have tried to use gradient descent linear regression to predict this simple mean value. I have done exactly like in the manual and for some reason my predictor coefficients are going to infinity, even though it worked for another case. Why, in this case, can it not predict a simple 1.4 value? clear all; n=10000; t=1.4; sigma_R = t*0.001; min_value_t = t-sigma_R; max_value_t = t+sigma_R; y_data …
Category: Data Science

Which algorithm to use for transactional data

I'm given a Dataset of transactions and asked to find insights for businesses. I'm extremely new to ML / Data science and have only been experiencing with KMeans. The dataset has the following features merchant ID Transaction date Military time Amount card amount paid merchant name Town area code client ID age band gender code province average income 3 months card value spending card tapped Ignoring NULL data, what type of analysis can I do on this data? I have …
Category: Data Science

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …
Category: Data Science

Finding the tightest (smallest) triangle that fits all points

I'm supposed to find an algorithm that, given a bunch of points on the Euclidean plane, I have to return the tightest (smallest) origin centered upright equilateral triangle that fits all the given points inside of it, in a way that if I input some random new point, the algorithm will return $+$ if the point is inside the triangle and $-$ if not. Someone has suggested me to go over all the possible points and find the point with …
Category: Data Science

Learner Algorithm Time & Sample Complexity

Let $X=R^{2}$. Let $u=\left(\frac{\sqrt{3}}{2},-\frac{1}{2}\right),\ w=\left(-\frac{\sqrt{3}}{2},-\frac{1}{2}\right),\ v=\left(0,1\right)$ and $C=H=\left\{h\left(r\right)=\left\{\left(x_{1},x_{2\ }\right)\ |\left(x_{1},x_{2\ }\right)\cdot u\le4,\ \left(x_{1},x_{2\ }\right)\cdot w\le r,\ \left(x_{1},x_{2\ }\right)\cdot v\le r\right\}\right\}$ for $r>0$, the set of all origin centered upright equilateral triangles. Describe a sample complexity algorithm $L$ that learns $C$ using $H$. State the time and sample complexity of your algorithm and prove it. I was faced with this question in a homework assignment and I'm a bit confused.. My solution is: Let D be our dataset Learner Algorithm: maxDistance …
Category: Data Science

How does the construction of a decision tree differ for different optimization metrics?

I understand how a decision tree is constructed (in the ID3 algorithm) using criterion such as entropy, gini index, and variance reduction. But the formulae for these criteria do not care about optimization metrics such as accuracy, recall, AUC, kappa, f1-score, and others. R and Python packages allow me to optimize for such metrics when I construct a decision tree. What do they do differently for each of these metrics? Where does the change happen? Is there a pattern to …
Category: Data Science

Which machine learning algorithms can be used for trajectory classifications?

I am working on project for clustering of air objects based on their trajectories. Like I would like to train a model on a dataset of different flying object's trajectories so later I can predict what type of object is based on trajectory data. Now trajectory data include 4 things (Altitude, Longitude, Latitude, and Time). So based on set of such dataset we may be able to classify objects like plane, rocket, missile, etc. What I cannot figure out is …
Category: Data Science

What Framework To Use for Asynchronous Algorithms?

I have a problem with an extremely large dataset (who doesn't?) which is stored in chunks such that there is low variance across chunks (i.e., the chunks are sort of representative). I wanted to play around with algorithms to do some classification in an asynchronous fashion but I wanted to code it up myself. A sample code would look like start a master distribute 10 chunks on 10 slaves while some criterion is not met for each s in slave: …
Topic: algorithms
Category: Data Science

Time Complexity notation in Big Data platforms

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over …
Category: Data Science

Which algorithms should I use for identifying similar characteristics between data points (the intersections)?

I am working with a dataset that has been coded and categorized, so that each datapoint has a set of coded characteristics. An example data point would be something like the following: Example Data Point: Quality Service & Support Price Each data point can have multiple codes associated with it. What I'm looking to do is identify the "intersections" between the data points so that I can answer questions like the following: When a data point has "Quality" as a …
Category: Data Science

Is it possible to make a label automatically in supervised learning(Machine Learning)?

My background knowledge: Basically, supervised learning is based on labeled data. Using the labeled data, the machine can study and determine results for unlabeled data. To do that, for example, if we handle picture issue, manpower is essentially needed to cut raw photo, label on the photos, and scan on the server for fundamental labeled data. I know it sounds weird, but i'm just curious if there are any algorithms/system to make a label automatically for supervised learning.
Category: Data Science

How do I select the "best" unsupervised machine learning algorithm to cluster my specific dataset?

I want to cluster a dataset without prior knowledge on the correct amount of clusters. For different algorithms (i.e. k-means, gmm...) I can iterate through different values and try to find the best solution for any given algorithm (i.e. ellbow-curve, silhouette-coefficient etc.). But I get very different results - as expected with different algorithms. K-Means is good for spherical clusters, density-based approaches for totally different cluster shapes. Now the actual question: How do I select the "best" unsupervised machine learning …
Category: Data Science

Predicting change of shapes/coordinates

I'm trying to find a way to predict/calculate how a shape (e.g. outline of a glacier) will change in the future—based on its history (previous shape) and additional factors (e.g. Δtemperature). In my example: I have the shape/coordinates of a glacier and an average temperature at 1970, 1985, 2000, 2015. How can I give an estimate on how that shape will look like in 2030, based on the previous shapes and a predicted temperature? The shapes would ideally come in …
Category: Data Science

Mixed Data Type Classification / Neighbor Algorithm

Here is a hypothetical simplified dataframe of my problem, which would be low dimensional (20ish features), containing some made-up information about certain dog breeds: Breed Min_Weight Max_Weight Min_Height Max_Height is_friendly grp Husky 10 20 30 35 True working Poodle 8 17 15 30 False terrier The algorithm would receive some information about a dog, and it would need to identify k-closest dog breeds based on the input data. It needs to be high performance. Example: algorithm receives an unknown breed …
Category: Data Science

Estimating location in a model

I have a big dataset with 10 columns and about a 100,000 rows. Each 5 rows represent a person being tracked and the data related to this tracking such as time, velocity, etc. the last two columns are the longitude and latitude for that person. To test the model, the test set has the fifth row for each person missing in longitude and latitude. What's the best way to approach this problem? for example the test set looks like: id …
Category: Data Science

How to detect that sequence of points belong to some model of first order theory?

Assume that every neural network can be recast to the sequence of layers (https://arxiv.org/abs/2106.14587 has chapter how to do this). Assume that layer U has N neurons. The set of possible activities of layer U forms the N-dimensional vector space. Each concrete state of layer U (in the sense of activities) can be described by N-dimensional vector (point) in this space. Assume, that NN functions or learns and assume that some First Order Theory (set of variables and functions and …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.