I see a lot of job opportunities in the field of data science but I'm not sure the difference between a data scientist and deep learning algorithm developer. Can someone explain that to me?
I have manually created a random data set around some mean value and I have tried to use gradient descent linear regression to predict this simple mean value. I have done exactly like in the manual and for some reason my predictor coefficients are going to infinity, even though it worked for another case. Why, in this case, can it not predict a simple 1.4 value? clear all; n=10000; t=1.4; sigma_R = t*0.001; min_value_t = t-sigma_R; max_value_t = t+sigma_R; y_data …
I'm given a Dataset of transactions and asked to find insights for businesses. I'm extremely new to ML / Data science and have only been experiencing with KMeans. The dataset has the following features merchant ID Transaction date Military time Amount card amount paid merchant name Town area code client ID age band gender code province average income 3 months card value spending card tapped Ignoring NULL data, what type of analysis can I do on this data? I have …
I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …
I'm supposed to find an algorithm that, given a bunch of points on the Euclidean plane, I have to return the tightest (smallest) origin centered upright equilateral triangle that fits all the given points inside of it, in a way that if I input some random new point, the algorithm will return $+$ if the point is inside the triangle and $-$ if not. Someone has suggested me to go over all the possible points and find the point with …
Let $X=R^{2}$. Let $u=\left(\frac{\sqrt{3}}{2},-\frac{1}{2}\right),\ w=\left(-\frac{\sqrt{3}}{2},-\frac{1}{2}\right),\ v=\left(0,1\right)$ and $C=H=\left\{h\left(r\right)=\left\{\left(x_{1},x_{2\ }\right)\ |\left(x_{1},x_{2\ }\right)\cdot u\le4,\ \left(x_{1},x_{2\ }\right)\cdot w\le r,\ \left(x_{1},x_{2\ }\right)\cdot v\le r\right\}\right\}$ for $r>0$, the set of all origin centered upright equilateral triangles. Describe a sample complexity algorithm $L$ that learns $C$ using $H$. State the time and sample complexity of your algorithm and prove it. I was faced with this question in a homework assignment and I'm a bit confused.. My solution is: Let D be our dataset Learner Algorithm: maxDistance …
I understand how a decision tree is constructed (in the ID3 algorithm) using criterion such as entropy, gini index, and variance reduction. But the formulae for these criteria do not care about optimization metrics such as accuracy, recall, AUC, kappa, f1-score, and others. R and Python packages allow me to optimize for such metrics when I construct a decision tree. What do they do differently for each of these metrics? Where does the change happen? Is there a pattern to …
when using matlab command 'fitctree' for classification purpose, and I change the order of the attributes I do not find the same Tree and thus the same classificaiton error? why? CART algorithm does take account on the attributes firstly introduced ?
I am working on project for clustering of air objects based on their trajectories. Like I would like to train a model on a dataset of different flying object's trajectories so later I can predict what type of object is based on trajectory data. Now trajectory data include 4 things (Altitude, Longitude, Latitude, and Time). So based on set of such dataset we may be able to classify objects like plane, rocket, missile, etc. What I cannot figure out is …
I have a problem with an extremely large dataset (who doesn't?) which is stored in chunks such that there is low variance across chunks (i.e., the chunks are sort of representative). I wanted to play around with algorithms to do some classification in an asynchronous fashion but I wanted to code it up myself. A sample code would look like start a master distribute 10 chunks on 10 slaves while some criterion is not met for each s in slave: …
I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over …
I am working with a dataset that has been coded and categorized, so that each datapoint has a set of coded characteristics. An example data point would be something like the following: Example Data Point: Quality Service & Support Price Each data point can have multiple codes associated with it. What I'm looking to do is identify the "intersections" between the data points so that I can answer questions like the following: When a data point has "Quality" as a …
My background knowledge: Basically, supervised learning is based on labeled data. Using the labeled data, the machine can study and determine results for unlabeled data. To do that, for example, if we handle picture issue, manpower is essentially needed to cut raw photo, label on the photos, and scan on the server for fundamental labeled data. I know it sounds weird, but i'm just curious if there are any algorithms/system to make a label automatically for supervised learning.
I want to cluster a dataset without prior knowledge on the correct amount of clusters. For different algorithms (i.e. k-means, gmm...) I can iterate through different values and try to find the best solution for any given algorithm (i.e. ellbow-curve, silhouette-coefficient etc.). But I get very different results - as expected with different algorithms. K-Means is good for spherical clusters, density-based approaches for totally different cluster shapes. Now the actual question: How do I select the "best" unsupervised machine learning …
I'm trying to find a way to predict/calculate how a shape (e.g. outline of a glacier) will change in the future—based on its history (previous shape) and additional factors (e.g. Δtemperature). In my example: I have the shape/coordinates of a glacier and an average temperature at 1970, 1985, 2000, 2015. How can I give an estimate on how that shape will look like in 2030, based on the previous shapes and a predicted temperature? The shapes would ideally come in …
Here is a hypothetical simplified dataframe of my problem, which would be low dimensional (20ish features), containing some made-up information about certain dog breeds: Breed Min_Weight Max_Weight Min_Height Max_Height is_friendly grp Husky 10 20 30 35 True working Poodle 8 17 15 30 False terrier The algorithm would receive some information about a dog, and it would need to identify k-closest dog breeds based on the input data. It needs to be high performance. Example: algorithm receives an unknown breed …
I have a big dataset with 10 columns and about a 100,000 rows. Each 5 rows represent a person being tracked and the data related to this tracking such as time, velocity, etc. the last two columns are the longitude and latitude for that person. To test the model, the test set has the fifth row for each person missing in longitude and latitude. What's the best way to approach this problem? for example the test set looks like: id …
df11[['COMPONENT_ID','FIRMWARE','SERIAL','CRP0_VDDN']].head() Consider I have these four columns to analyse. I want to form say 3-5 clusters of COMPONENT_IDs with similar characters. I want this to happen based on the remaining features or just CRPO_VDNN in relation with COMPONENT_IDs. How can I do this ?
I am trying to interpret a black box model. This model is a random forest that I am using to make predictions. I have read that LIME is a way to interpret black box models, but I don't quite know how to interpret the following graphs: If someone could help me to interpret them or tell me how to do it, it would be of great help. Thank you.
Assume that every neural network can be recast to the sequence of layers (https://arxiv.org/abs/2106.14587 has chapter how to do this). Assume that layer U has N neurons. The set of possible activities of layer U forms the N-dimensional vector space. Each concrete state of layer U (in the sense of activities) can be described by N-dimensional vector (point) in this space. Assume, that NN functions or learns and assume that some First Order Theory (set of variables and functions and …