Importing Excel format data into R/R Studio and using glmnet package?

I have no problem importing Excel formatted data into R/R Studio and use all other R packages that I use. But, when I want to use the glmnet package to develop a regularization model, I invariably run into the following error (after specifying my regularization model and attempting to run it): Error in storage.mode(y) <- "double": (list) object cannot be coerced to type 'double' Here is what I have already tried to resolve this: De-format the numbers in Excel (no …
Category: Data Science

sklearn FutureWarning message when running a CNN model

When I run my model, I am receiving the following error message: FutureWarning: Pass classes=[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24], y=[20 16 4 ... 2 2 2] as keyword args. From version 0.25 passing these as positional arguments will result in an error FutureWarning) I am assuming I need to pass them as keyword args. I am new to the …
Category: Data Science

Making Sense of this Error Message

I am using a book and a video to learn how to use KNN method to classify movies according to their genres.This is my code: import numpy as np import pandas as pd r_cols = ['user_id', 'movie_id', 'rating'] ratings = pd.read_csv('C:/Users/dell/Downloads/DataScience/DataScience-Python3/ml-100k/u.data', sep='\t', engine='python', names=r_cols, usecols=range(3)) # The file is u.data from MovieLens print(ratings.head()) movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]}) print(movieProperties.head()) movieNumRatings = pd.DataFrame(movieProperties['rating']['size']) movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x))) print(movieNormalizedNumRatings.head()) movieDict = {} with open('C:/Users/dell/Downloads/DataScience/DataScience-Python3/ml-100k/u.item') as …
Category: Data Science

How to create a system to detect text structure of a file?

Let's say I want to create a Machine Learning system that has a lot of log files of some few types (F1, F2,.. Fn) and I get a new Log file with maybe some errors or missing data. How do I classify it into these class types or classify it is an anomaly if it doesn't belong to anyone of them. I thought about anomaly detection but couldn't figure how to parse structure information from the text classes like (F1, …
Category: Data Science

Characterizing errors in rotation and translation while estimating camera pose from images

Has anyone characterized the errors in rotation and translation while estimating camera pose of natural images using SFM or visual odometry? I mean, when estimating camera pose, what is the typical amount of error in rotation and translation that one can expect? Any references on errors in odometry sensors are also welcome.
Category: Data Science

How to include the sudden peaks/bursts in LSTM based time-series model's training

I am using LSTM for time-series prediction whereby I am taking past 50 values as my input. Now, the thing is that it is predicting just OKish, and not doing the exact prediction, especially for the peaks. Any help about how can I train my model to tackle this problem and take the peaks into account so that I can predict more accurately (if not EXACTLY). THe model summary and the results are as below:
Category: Data Science

k nearest neighbors method, temporal trend in error

I have this set of data that looks like this I was asked o build a $k$-nearest neighbors algorithm for it which I just finished building. I have this question in regards to the data that I do not understand: Do you notice any spatial or temporal trends in error? I am not sure how to proceed in answering that question. Any suggestions would be appreciated.
Category: Data Science

ImportError: Pandas requires version '0.3.0' or newer of 's3fs'

I'm trying to read files from S3, using boto3, pandas, anaconda, but I have the following error: ImportError: Pandas requires version '0.3.0' or newer of 's3fs' (version '0.1.6' currently installed). How can I update the s3fs version? This is my code: import boto3 import pandas as pd s3 = boto3.resource('s3') bucket= s3.Bucket('bucketname') files = list(bucket.objects.all()) files objects = bucket.objects.filter(Prefix='bucketname/') objects = bucket.objects.filter(Prefix="Teste/") file_list = [] for obj in objects: df = pd.read_csv(f's3://bucketname/{obj.key}') file_list.append(df) final_df = pd.concat(file_list) print (final_df.head(4))
Category: Data Science

Comparing RMSEs of multiple test sets having different sizes

The data I have is a time series data (stock returns), and I am training a Random Forest Regressor on it. Total observations = 2499 To better evaluate the performance, I have implemented rolling windows testing with training window sizes = 500, 700, 900,..., 2100. Though instinctively it would seem obvious to choose a window size which produced lowest RMSE, how can I be sure that the comparison is fair? I mean with increasing window size, the test set size …
Category: Data Science

Operands Could not be Broadcast with Shapes (19,)(0,)

I have googled and read something similar to the problem I have but I do not seem to know how to fix the error I got from this particular code: import operator def getNeighbors(movieID, K): distances = [] for movie in movieDict: if (movie != movieID): dist = ComputeDistance(movieDict[movieID], movieDict[movie]) distances.append((movie, dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(K): neighbors.append(distance[x][0]) return neighbors K = 10 avgRating = 0 neighbors = getNeighbors(1, K) **ValueError:** operands could not be broadcast together …
Category: Data Science

Xgboost fit won't recognize my custom eval_metric. Why?

Do you know why my custom_eval_metric doesn't work? I get the error: XGBoostError: [07:56:32] C:\Users\Administrator\workspace\xgboost-win64_release_1.4.0\src\metric\metric.cc:49: Unknown metric function custom_eval_metric def custom_eval_metric(preds, dtrain): labels = dtrain.get_label() preds = preds.reshape(-1, 3) preds_binary = [] for element in range(0,len(preds)): tmp = [] tmp = preds[element][2] preds_binary.append(tmp) labels_adj = [0 if x == 1 else x for x in labels] labels_adj = [1 if x == 2 else x for x in labels_adj] preds_binary = np.asarray([preds_binary]) labels_adj = np.asarray([labels_adj]) return 'ndcg score', metrics.ndcg_score(new_items, preds) …
Category: Data Science

Multiclass classification oob error

Im implementing a random forrest for a 6 class classification and witnessing a strange phenomenon. I have 10 percent of my set sectioned out as a pseudo validation set. Im training 50 percent of the training items (training items being 90 percent of the whole set) per tree randomly selected. Now my oob error is almost the mirror image of my validation error. Im using averaged f1 error (ie average of the f1 error per class). As more trees are …
Category: Data Science

Why does the MAE still remain, at all?

This may seem to be a silly question. But I just wonder why the MAE doesn't reduce to values close to 0. It's the result of an MLP with 2 hidden layers and 6 neurons per hidden layer, trying to estimate one outputvalue depending on three input values. Why is the NN (simple feedforward and backprop, nothing special) not able to maybe even overfit and meet the desired training values? Costfunction = $0.5 (Target - Modeloutput)^2$ EDIT: Indeed I found …
Category: Data Science

PySpark: java.io.EOFException

System: 1 name node, 4 cores, 16 GB RAM 1 master node, 4 cores, 16 GB RAM 6 data nodes, 4 cores, 16 GB RAM each 6 worker nodes, 4 cores, 16 GB RAM each around 5 Terabytes of storage space The data nodes and worker nodes exist on the same 6 machines and the name node and master node exist on the same machine. In our docker compose, we have 6 GB set for the master, 8 GB set …
Category: Data Science

Python - Logistic (Logit) Regression - why am I getting an Endog error?

I'm running the following code: X = dataset[['X1 transaction date', 'X2 house age', 'X3 distance to the nearest MRT station', 'X4 number of convenience stores', 'X5 latitude', 'X6 longitude', 'X7 distance to Xindian Ditsrict Office', 'X8 distance to Cardinal Tien Hospital', 'X9 distance to Shih Hsin University']] y = dataset['Y house price of unit area'] model = sm.Logit(y, X).fit() print(model.summary()) I'm using a CSV dataframe with information about 414 different residential properties in the Xindian District of Taiwan. My goal …
Category: Data Science

recognizing the correct word & "Set type is unordered"-error in python-pandas

My Data Set (CSV): CL1,CL2,CL3 Hello Worrld,Hello ! World,Snack Hello % World,Hello World,Vol 8.5% Alc Hello World,Good! Hello,Hello World Good Morning,Airplane,Good Morning JK^KJ,Good Morning,Talueas My Goal: 1- I would like to search and find the similar values between all columns (CL1-CL3) and sort in a new column (SIM). 2- I would like to find the non-similar values between columns and sort in another column (NON-SIM). What I Would Like: Actually, I would like to use it in supervised learning for …
Category: Data Science

Implementation of reliable rule learning

I want to perform "reliable rule learning", i.e. mining a set of rules with a very low number of false negatives. I recently read the paper "Reliable agnostic learning" by Kalai et al. (https://doi.org/10.1016/j.jcss.2011.12.026) and they basically describe what I want: Rules are determined to reliably classify data points, and the reliability is partly reached by allowing "I don't know" as an additional answer. Sadly, their paper is purely theoretical and I could not find a corresponding implementation. Is there …
Category: Data Science

TypeError: '<' not supported between instances of 'int' and 'str'

I have the following code rf = RandomForestClassifier() rf.fit(X_train, Y_train) print(&quot;Features sorted by their score:&quot;) print(sorted(zip(map(lambda x: round(x, 2), rf.feature_importances_), X_train), reverse=True)) and I get the following error: &gt; TypeError Traceback (most recent call last) &gt; &gt; ipython-input-109-c48c3ffd74e2&gt; in &lt;module&gt;() &gt; &gt; 2 rf.fit(X_train, Y_train) &gt; &gt; 3 print (&quot;Features sorted by their score:&quot;) &gt; &gt; ----&gt; 4 print (sorted(zip(map(lambda x: round(x, 2), &gt; rf.feature_importances_), X_train), reverse=True)) &gt; &gt; TypeError: '&lt;' not supported between instances of 'int' and 'str' I …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.