I have a dataset of a couple of EV charging stations (10 min frequency) over 1 year. This data consists of lots of 0s, since there is no continuous flow of cars coming to charge, but rather reoccurring charging events as peaks (for example from 7-9am seems to be a frequent charging timeframe when people are coming to the office). I have also aggregated weather and weekday/holiday data to be used as features. I now wish to predict the energy …
I'm wondering if the approach I'm thinking of could even work. I want to use dictionary learning for image classification. The first step would be to learn the dictionary from a set of similar yet different images to be able to extract background from an image. For example, I have a set (e.g. 500 photos) of images of the same object, but the scenes differ (light, the angle the photo was taken at etc.) Basically, the main object is the …
I have a pandas data frame with about Million rows and 3 columns. The columns are of 3 different datatypes. NumberOfFollowers is of a numerical datatype, UserName is of a categorical data type, Embeddings is of categorical-set type. df: Index NumberOfFollowers UserName Embeddings Target Variable 0 15 name1 [0.5 0.3 0.2] 0 1 4 name2 [0.4 0.2 0.4] 1 2 8 name3 [0.5 0.5 0.0] 0 3 10 name1 [0.1 0.0 0.9] 0 ... ... .... ... .. I would …
I have a dataset of a couple of EV charging stations (10 min frequency) over 1 year. This data consists of lots of 0's, since there is no continuous flow of cars coming to charge but rather reoccurring charging events as peaks(for example from 7-9 am seems to be a frequent charging timeframe when people are coming to the office) I have also aggregated weather and weekday/holiday data to be used as features. I now wish to predict the energy …
Suppose I have a continuous y response variable and a very large matrix of boolean sparse predictor variables X. What would be the best regression method to use?
I have been reading about weight sparsity and activity sparsity with regard to convolutional neural networks. Weight sparsity I understood as having more trainable weights being exactly zero, which would essentially mean having less connections, allowing for a smaller memory footprint and quicker inference on test data. Additionally, it would help against overfitting (which I understand in terms of smaller weights leading to simpler models/Ockham's razor). From what I understand now, activity sparsity is analogous in that it would lead …
Hopefully I´m at the right place for my question: I´m looking for suggestions for models to use to classify multivariate time series. I´m trying to find a way of classifying the behaviour of motors into "good" or "bad" based on current measurments. I found many possible examples (as found for example in the library sktime) to use, but my biggest problem is that the dataset I have captured is incredibly small because of difficulties in the testing environment. The dataset …
I am playing with a dimensionality reduction step prior to clustering for a pretty large sparse binary matrix of almost 3000 columns and 50k rows. My idea is to embed the 3000 dimensions into a two-dimensional space with UMAP and then cluster the resulting 50,000 two-dimensional points with HDBScan. I've found that UMAP accepts a number of options, such as the metric, n_neighbors, min_dist and spread, but I cannot figure out what should be the best combination giving me distinct …
I have a matrix with sparse data. A small extract from it is seen below. The columns represent years and the rows represent different race tracks. The feature values are velocities on that specific track a specific year. Generally the velocity increases with the year but that is not necessarily true. As seen below the matrix is sparse and for some tracks I only have values for a single year. How can one most accurately predict the missing values? I …
I have a sparse matrix, $X$, created by TfidfVectorizer and its size is $(500000, 200000)$. I want to convert $X$ to a data frame but I'm always getting a memory error. I tried pd.DataFrame(X.toarray(), columns=tokens) and pd.read_csv(X.toarray().astype("float32"), columns=tokens, chunksize=...). And it seems that when I convert $X$ to a numpy array using X.toarray(), I get an error. Can someone tell me what is an easy solution for this? Is there anyway I can create a sparse dataframe from $X$ without …
There are 4 datasets (all in csv format), each has a uniqueID column by which each record can be identified. Image and text datasets are dense datasets.(need to be converted to ndarray). Can someone suggest how to use all these 4 datasets for building a regression model? This is how the datasets look, Metadata having some input features and target variable(views) uniqueID ad_blocked embed duration language hour views 1 True True 68 3 10 244 2 False True 90 1 …
I have a big dataset with a column "clientid" and a categorical column "choice". I want to find out what are the clients that have strange combinations of choices (less frequent ones) and being able in the future to identify new strange combinations of future clients immediately. clientid choice cl1 a cl2 b cl2 c cl3 d cl4 b cl4 c If I transpose the table by clientID I have a row for each client and different columns based on …
I'm following a guide here to implement image segmentation in Keras. One thing I'm confused about are these lines: # Ground truth labels are 1, 2, 3. Subtract one to make them 0, 1, 2: y[j] -= 1 The ground truth targets are .png files with either 1,2 or 3 in a particular pixel position to indicate the following: Pixel Annotations: 1: Foreground 2:Background 3: Not classified When I remove this -1, my sparse_categorical_crossentropy values come out as nan during …
I am looking for a metric for comparing gene count tables. These are long columns of data (a few millions genes by a few dozen samples), with all non-negative entries, about 90% of which are zeros. The goal is to compare the performance of several tools/algorithms that these tables originate from, by comparing the resulting tables among themselves or with the expected counts (in a case of sumulates data). In principle, one compares on a sample-by-sample basis, but comparing different …
Given that I have a very sparse data matrix with continuous features, like this dataframe for example Feature_A Feature_B Feature_C....Feature_Z 0.3 0 0.1 0 0.5 0.5 0 0 0 0 1.0 0 1.0 0 0 0 0.7 0 0 0 1.0 0 0 0 0.1 0 0.22 0.43 what is the best way to perform unsupervised anomaly detection on this kind of data? my initial idea was to perform some kind of dimensionality reduction first (e.g SVD or NMF) then …
I really do not understand what does this code do M = sparse.coo_matrix(([1]*n, (Y, range(n))), shape=(k,n)).toarray() The code is related to calculating the sparse function in this equation, but I am really confused and I do not know how it iterates through it and what is: 1- sparse.coo_matrix 2- (Y, range(n))) 3-shape=(k,n)).toarray() ?? Also, What exactly does this term means in the equation and how to interpret it into code: Thank you , and please forgive my poor English.
I am attempting to train an autoencoder on data that is extremely sparse. Each datapoint is only zeros and ones and contains ~3% 1s. Being that the data is mostly zero the autoencoder learns to guess zero every time. Is there a way to prevent this from happening? To give context this is extremely sparse data when you consider that the number of features is over 865,000
I'm very new to data science and still trying to get the grips. The problem I'm trying to tackle is, we have a pool of footballers from a league and data objects representing a group of 11 footballers for a given match and the number of goals scored by that team on that match. The goal is to estimate the number of goals that are potentially to be scored given any random line up of footballers from this pool. This …
I was reading this article https://www.di.ens.fr/~aspremon/PDF/CovSelSIMAX.pdf, whose goal is to estimate the covariance matrix from a the sample covariance matrix drawn from a distribution $X$. ' Given a sample covariance matrix, we solve a maximum likelihood problem penalized by the number of nonzero coefficients in the inverse covariance matrix. Our objective is to find a sparse representation of the sample data and to highlight conditional independence relationships between the sample variables.' The likelihood problem is only for the case where …
I have a largely uncorrelated feature space of about 40 dichotomous features, using which I'm trying to predict a continuous target variable. Now, some of these features are very sparse (Active less than 10% of the time, with the rest as zeros). But the few times that these features are active may be really good predictors of the target. In most algorithms, these features will be mostly ignored due to how sparse they are - despite their predictive ability. What …