Remove all characters following a certain character in a column of a dataset

I have a data set like the following, and the first column contains the groupings. However, some are labelled slightly differently. I need to remove all characters following the punctuation used (bracket, semicolon, comma). groups <- c("Group1", "Group1", "Group1;Group1", "Group1(subset)", "Group1,ex" ) I would like this to present all of these just as Group1 (so they would all appear the same as the first two) - so to remove all characters in the string following the punctuation. I then need …
Category: Data Science

Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …
Category: Data Science

A clear visualization of a two-way ANOVA

To provide a full yet simple picture of a 3-level, one-way ANOVA, I use the following visualization where variation within each group (the filled circles) and variation between the groups (black arrows) are simple to be understood. But I'm wondering if it could be possible to extend the current visualization to a 2 x 3 two-way ANOVA (adding another way with two groups to the current visualization)? (Note: the dashed vertical lines denote each group's mean)
Category: Data Science

Nice summary table in R

I have a vector (column from an imported csv file) that I'd like to make some summary statistics from and put it in a table and put in a small report. Can R do this. So basically I have 12 columns (1 for each dataset and in the created table I want them as rows) and for each, I'd like to calculate mean, min, max, coefficient of varation, sd, kurtosis, etc... What is a good way to do this?
Topic: r
Category: Data Science

In a Time Series Problem, is it possible to forecast quantities by learning the patterns of other items? What are my options?

Suppose I own a store that sells a variety of apples and I have the following stats each month. Report Date Type of Apple (TA) Quantity Available(QA) Quantity Sold in the Past 30 days(QS30) Quantity Shipping In (QSI) Quantity Needed to Order(QN) Lets make the following assumptions/givens: There are three types of apples: red apples, green apples and yellow apples. T(1) denotes the first month and T(60) denotes the 60th month. QA @ T(i + 1) = QA@T(i) + QSI@T(i) …
Category: Data Science

How to assign costs to the confusion matrix

I am trying to assign costs to the confusion matrix. That is, in my problem, a FP does not have the same cost as a FN, so I want to assign to these cases a cost "x" so that the algorithm learns based on those costs. I will explain my case a little more with an example: When we want to detect credit card fraud, it does not have the same cost to predict that it is not fraud when …
Category: Data Science

How to cluster time series of ordered data?

There are a few hundred time series of a large set of different locations (irregularly distributed) with the following properties: ordered factor (5 levels) between 5 and 25 observations per series lots of missing values within each series temporal and spatial autocorrelation (unknown) temporal frequency The objective is to spatially cluster the time series based on their similarity (of observed value per point in time). What would be adequate methods? The analysis will be carried out in R.
Category: Data Science

Loss function to prevent estimator bias

I have a regression problem I'm trying to build a model for: Predicting sales per person (>= 0) depending on some variables. I'm running different model types and gave deep neural networks a try. The loss functions I'm using are mean squared error and mean absolute error (or sometimes a mix). I often run into this issue though, that despite mse and mae are being optimized, I end up with a very strong bias in the prediction, e.g. sum(training_all_predictions) / …
Category: Data Science

Understanding output stepAIC

I am using the stepAIC function in R to do a bi-directional (forward and backward) stepwise regression. I do not understand what each return value from the function means. The output is: Df Sum of Sq RSS AIC <none> 350.71 -5406.0 - aaa 1 0.283 350.99 -5405.9 - bbb 1 0.339 351.05 -5405.4 - ccc 1 0.982 351.69 -5400.5 - ddd 1 0.989 351.70 -5400.5 Question Are the values listed under Df, Sum of Sq, RSS, and AIC the values …
Category: Data Science

Customising Objective Function in R

I wondered if there were R packages that allow the user to customise the loss function? For example, if a random forest package like ranger had a loss function which minimises OOB MSE. Is it somehow possible to customise this into a model which minimises negative log likelihood? Would appreciate if someone knew of examples/code of doing this on a toy dataset
Category: Data Science

Error loading tidyverse in R Studio - No broom package

I'm getting numerous errors when trying to install packages, namely tidyverse and ggplot. The error is of the form are always in the form: > library(tidyverse) Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : there is no package called ‘broom’ Error: package or namespace load failed for ‘tidyverse’ I have already tried installing the package broom independently with dependecies = TRUE per the below Having trouble installing and loading tidyverse- No DIB package I've also tried restart …
Topic: rstudio r
Category: Data Science

Importing Excel format data into R/R Studio and using glmnet package?

I have no problem importing Excel formatted data into R/R Studio and use all other R packages that I use. But, when I want to use the glmnet package to develop a regularization model, I invariably run into the following error (after specifying my regularization model and attempting to run it): Error in storage.mode(y) <- "double": (list) object cannot be coerced to type 'double' Here is what I have already tried to resolve this: De-format the numbers in Excel (no …
Category: Data Science

How to make a dataframe with lists or vectors as its elements

This is something I have been wondering for ages but I am never able to get an answer. I am trying to understand how to make a dataframe in R, where each element of the dataframe is itself a vector or a matrix. For example, lets say we have a regular vector V with elements being real numbers. Then to acess any number we would have V[3] which would give the third element of said vector. Now I want to …
Topic: code data r
Category: Data Science

Getting below error in time and day split in R

data$eu_indicator<-as.factor(data$eu_indicator) data$hour<-hour(data$calc_created) data$day<-date(data$calc_created) #data transformation #datetime: split on date and hours error msg:::Error in hour(data$calc_created) : could not find function "hour" Error in date(data$calc_created) : unused argument (data$calc_created)
Category: Data Science

Relating changes of a value in time to known events

I work with two datasets. The first dataset contains fluor values measured every minute. The second dataset contains certain events and their time. We know that these events cause peaks in fluor values shortly before and shortly after the event time. A simplified reproducible example in R: Here I provide a simplified version of the R code I use to relate the fluor values to events. I have a series of fluor values measured every minute. Next I have a …
Topic: time-series r
Category: Data Science

How to arrange web scraped data in a table using R?

Original Code library(netstat) library(RSelenium) library(tidyverse) obj<-rsDriver(browser="chrome",chromever="101.0.4951.15",verbose=F,port=free_port()) remDr<-obj$client remDr$navigate('https://www.imdb.com/search/title/?year=2022&title_type=feature&') Title<-remDr$findElements(using='css','.lister-item-header a') lapply(Title,function(x) { x$getElementText()%>% unlist() }) o/p: [[1]] 1 "Doctor Strange in the Multiverse of Madness" [[2]] 1 "Senior Year" My attempts to arrange data in tabular form- 1.movies=data.frame(Title,stringsAsFactors=FALSE) view(movies) **Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘structure("webElement", package = "RSelenium")’ to a data.frame** 2.movies=data.frame(x,stringsAsFactors=FALSE) view(movies) **Error in data.frame(X, stringsAsFactors = FALSE) : object 'X' not found** 3.Part of original code tweaked- lapply(Title,function(x) { **t<-list(x$getElementText()%>% unlist())** }) l=data.frame("movie"=t,stringsAsFactors …
Category: Data Science

faster alternatives to sparse.model.matrix?

I have a large dataset that is entirely categorical. I'm trying to train with it using xgboost, so I must first convert this categorical data to numerical. So far I've been using sparse.model.matrix() in the Matrix library but it is far too slow. I found a great solution here, however, the sparse matrix it returns in not the same one that sparse.model.matrix returns. I know there is a way to force sparse.model.matrix to return identical output as the solution in …
Category: Data Science

confidence interval around standardised regression coefficient?

I have computed a simple linear regression model as below, but am confused as to whether the confint() function is sufficient to provide 95% confidence intervals around the standardised regression coefficient in the linear model (beta)? Has anyone else run into this issue or is confint() sufficient to extract the 95% confidence interval (i.e., +/-1.96 standard errors of the standardised regression coefficient)? h1a <- lm(formula = var1~ var2, data = df) # estimate value of intercept (b0) and slope (b1) …
Category: Data Science

Fix first two levels of decision tree?

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals: Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will …
Category: Data Science

classify blocks of time series of unknown section lengths basis slope and smoothed diff

What will be the best approach to classify blocks of a univariate time series that contains data of a fuel tank level (data captured every 30 seconds)? The slope of the curve would be an important feature but only after removing noise. An example TSD looks like this. Marked in red is "Sudden drain" classified manually. I need to classify blocks of the time series on following 5 categories. Idle (no change in fuel level) Suddent drain (very fast reduction …
Topic: time-series r
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.