Feature Selection on Aggregated Targetdata

I have a question about feature selection on a dataset where the target variable is aggregated by the sum of different data points. I want to predict the number of sales depending on a variety of features like: week price per unit store country store city 2-3 other categorical meta-data other features I am aware that this data should be interpreted as time series but because of the lack of available historical data, no model can compete with the naive …
Category: Data Science

Supervised learning on sources of information with different importance

I am trying to classify customer support sessions using supervised machine learning. In each customer support session I have 3 bags of information. 1. The title of the customer's complaint 2. Information about the device the customer was using 3. Text of the chat session with the customer support agent In each customer support session, there are 6 different classes. Is it better to: 1. Train a classifier on each bag of information and have them vote on which class …
Category: Data Science

Learning the Average of a 0/1 Dependent Variable

uppose I have a matrix and a dependent vector whose entries are each in {0,1} dependent on the corresponding row of Given this dataset, I'd like to learn a model, so that given some other dataset ′, I could predict average(′) of the dependent-variable vector ′. Note that I'm only interested in the response on the aggregate level of the entire dataset. One way of doing so would be to train a calibrated binary classifier →, apply it to ′, …
Category: Data Science

How can we predict a value after several rows of data?

I have a regression problem in which for each week I have several rows (variable between rows i.e 1 week might have 1800 rows and other might have 5000 rows). My target is to predict a value at end of each week's data. Here's an example of what I need : x is a feature y is the target. Week 1 ; x1, x2, x3.. x90 Week 1 ; v1, v2, v3... v90 .... 100 more rows Week 1 ; …
Category: Data Science

How to deal with a potencially multiple categorical variable

I'm build a model that has, as inputs, some categorical variables. I had already dealt with this sort of data before, and applied different techniques as creation of dummy variables and factor scoring. However, I have now a different type of problem which I can not see the obvious best answer to. For each individual we can have multiple instances of this categorical variable $X$. When such cases happen on numerical variables I usually take the max/mean/min depending on context. …
Category: Data Science

Tableau: keeping results independent of view / filter

I am using Tableau Desktop 2021.1.4 Suppose that my source sales data consists of 4 columns Region (dimension with values: N,E,W,S), Type (dimension with values: Furniture, Electronics, Appliances), Year (dimension with values: 2021, 2020, 2020), and sales ($). I would like to generate a Calculated Field, say "Sum of Sales", where the summation: is always over all the regions and all the types, regardless of what is in the view can also be over the different years or can be …
Category: Data Science

Aggregating transactional data for customer segmentation

I have item-level transactional data where each row in the data represents a different item bought by a customer in a transaction (so if two different items were bought in the same transaction by the same customer there would be two rows where the customer_id and the transaction_id columns have the same value) Eg: Customer_id transaction_id item_bought quantity a 00001 cheese 2 b 00002 ham 1 b 00002 pepsi 2 In this case customer b bought two items in the …
Category: Data Science

Python Pandas agg error

I am trying to generate descriptive statistics using agg function in Pandas. I am having trouble with one line with a lambda function. They work when I run them as separate lines of code, but when I put them as a single line I get errors. Any guidance is much appreciated. The following two lines of codes work when I run them individually. First line of code: bh_df.groupby('CAT.MEDV').agg( avg_Nox=('NOX', 'mean')) Second line with lambda function. bh_df.groupby('CAT.MEDV').agg( rng=("NOX", lambda x: (max(x) …
Category: Data Science

Labeling and aggregating features issue

I am trying build a simple binary classifier (some tree based algorithm for now) and my training data will have features aggregated at the user level. So I'll have a unique records of each user. These aggregated features are like "number of logged in sessions", "number of times profile button was clicked" etc - essentially these are website browse behavior features. What I am trying to predict is if someone would be interested in subscribing or not. Some users might …
Category: Data Science

Concatenating Data in two years

I have to use a Machine Learning Model to predict the Electricity consumption and carbon emission based on some buildings' features. (Area, year of construction ...) Here is the link to the data. The problem is that I have data from 2 years 2015 and 2016, for each year I have some buildings and the mean of consumption and emission. I'm wondering what is the best way to concatenate the data. Since there are some buildings that are registered only …
Category: Data Science

How do you aggregate features of lists (pooling alternatives)?

Is it possible to reduce non-correlated multi-dimensional data over features to 1D data? A working option is pooling (mean/min/max) over an embedding vector (n samples of embeddings of m dimensions). E.g. converts many embeddings (n × m) to a list of means (1 × m). However, these all loose a lot of information (especially the relationships between features in single embeddings). This doesn't have to be a reduction (i.e. the resulting 1D vector can be larger than m). If it's …
Category: Data Science

How to aggregate data inserted by users to avoid outliers?

I'm developing a new application based on machine learning. In this application users can insert new data to improve the prediction system. As you may guess, users could insert data that doesn't make sense, generating in this way outliers that may harm the prediction accuracy. I'm pretty new to this field so I would like to ask you: do you know any strategy to mitigate this? Maybe by implementing a voting or aggregating system? In that case, do you have …
Category: Data Science

MongoDB Groupby Rank

Im Working With Mongodb And Wanted to do a query using Aggregate fucntion. Query Is Each city has several zip codes. Find the city in each state with the most number of zip codes and rank those cities along with the states using the city populations. The documents are in the following format { "_id": "10280", "city": "NEW YORK", "state": "NY", "pop": 5574, "loc": [ -74.016323, 40.710537 ] } I was able to count no of Zipcodes for each state …
Category: Data Science

Aggregating standard deviations

Imagine I have a collection of data, let's say the travel time for a road segment. On this collection I want to calculate the mean and the standard deviation. Nothing hard so far. Now imagine that instead of having my collection of values for one road segment, I have multiple collections of values that correspond to the multiple sub segments that compose the road segment. For each of these collections, I know the average and the standard deviation. From that, …
Topic: aggregation
Category: Data Science

Using R to organize/rearrange CSV - group by multiple columns?

I have a CSV that I need to clean up / organize in a usable way using R. I need to group by the property ID and then want to take all the unique years for the defor year column and make each year into a sperate column with the amount of deforestation for that year. My data frame / CSV looks like this: Prop_ID deforYear deforHA 1 2010 15 1 2011 0 1 2012 10 2 2010 35 2 …
Topic: aggregation r
Category: Data Science

R: Calculations based on frequencies / grouped / aggregate data

I am trying to do simple calculations in R when no raw data but grouped data with frequencies is available only. This is the case when I have a large amount of records in a database, say a large SQL table, and then for given reasons GROUP BY and COUNT to aggregate instead of downloading the original table for analysis in R. As I understand, one could say in R that I'm talking about data in a table format. To …
Category: Data Science

How to get a (descriptive) overview of a large database?

I'm facing a data framework with ~ 20 k observations and 151 variables across 2078 subjects At first I am primarily interested in how the data looks like related to a single parameter. But I cannot plot 2078 subjects on the x-axis and make a bar plot out of it or so. What would be useful methods for such a situation? I prefer some visualizations but I think they won't be applicable. I'm afraid even non-visualization methods are not really …
Category: Data Science

Heatmap of large 2D array using datashader and plotly

I’m trying to show a heatmap of a large 2D array (160x250000 entries). This should go into a dash app so I'm using plotly to deal with graphics and my idea was to use datashader for performance but I’m having troubles getting it right. However, independently of dash I’m already having problem with plotly + datashader (see code below). There is probably something very basic I’m not understanding in this process. It would be great if someone could tell me …
Category: Data Science

Data system that manages aggregates over time intervals

I am looking to know if there is a data system that handles the following use case. To keep it simple, the data is a set of homogeneous enties E. E contains named numeric properties that the app code increments as the case may be over the life cycle of the application. I will want to query the state of E for a set of time intervals. To keep it simple, let's say this is today, this month, this year …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.