How do you maximize a left skewed variable in a dataset?

I analyzed a dataset of tourist expenses for a country here: https://rpubs.com/lovepeacejoy404/tourist_spending_tanzania and I noticed that the variable total_cost (how much the tourist spent in total) is skewed left because tourists tend to spend as little as possible and then there are few who spend a lot. If I am interested in identifying which aspects of tourism are more profitable and in which it is worthwhile to invest, should I consider for the categorical variables the highest value of the …
Category: Data Science

Optimal representative participation in the age of AI/ML?

Just answered the great questions of the 2022 Stack Overflow Developer Survey and earned the Census - Badge, here I see the number of filled surveys and as an enthusiastic beginner in data science I am curious about the optimal representative participation in case of 17,866,773 total users!? I did a quick research, but stuck at the following sentence: "Once the population exceeds 20,000, your sample size will not change very much anymore." Please help me to understand this (agree/disagree …
Category: Data Science

How do I get the divided values of two columns that are a result from a groupby method

I currently have a dataframe that was made by the following example code df.groupby(['col1', 'col2', 'Count'])[['Sum']].agg('sum') which looks like this col1 col2 Count Sum DOG HUSKY 600 1500 CAT CALICO 200 3000 BIRD BLUE JAY 1500 4500 I would like to create a new column which outputs the division of df['Sum'] and df['Count'] The expected data frame would look like this col1 col2 Count Sum Average DOG HUSKY 600 1500 2.5 CAT CALICO 200 3000 15 BIRD BLUE JAY 1500 …
Category: Data Science

Predicting Customer Activity Absence

Could you please assist me with to following question? I have a customer activity dataframe that looks like this: It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December). Could you …
Category: Data Science

Getting below error in time and day split in R

data$eu_indicator<-as.factor(data$eu_indicator) data$hour<-hour(data$calc_created) data$day<-date(data$calc_created) #data transformation #datetime: split on date and hours error msg:::Error in hour(data$calc_created) : could not find function "hour" Error in date(data$calc_created) : unused argument (data$calc_created)
Category: Data Science

How to return the number of values that has a specific count

I would like to find how many occurrences of a specific value count a column contains. For example, based on the data frame below, I want to find how many values in the ID column are repeated twice | ID | | -------- | | 000001 | | 000001 | | 000002 | | 000002 | | 000002 | | 000003 | | 000003 | The output should look something like this Number of ID's repeated twice: 2 The ID's …
Category: Data Science

Determine the effect on margins of a price increase

I hope you can help guide me in the right direction! Any advice is appreciated! Situation I'm currently analyzing the effect of a price increase from a retailer on a few 100 products. I'm interested in understanding the effect of the price increase on volume, sales value, and margin. The data I have available is weekly product-level data in terms of sales value, volume, and margin for products that had a price increase and for products that did not have …
Category: Data Science

Data Analytics how to read ECDF graph

Hi there, My question is about how to read ECDF graphs. I am still quite unsure what the jumps / zig-zags in the graph mean and what is happening when there is a horizontal line and so on. I would be happy if someone can explain me how I am suppose to read this graph and what information I can get from it. Thank you
Category: Data Science

Are there any open datasets for commercial use?

I am creating a bootcamp for data analyst and it's been 2 days I am looking for some good dataset fit for commercial use that I can use to create Tableau and Power BI tutorials. Even on kaggle some datasets are licensed as CC0 but when you track back the company the data was scrapped from, it states that the data shouldn't be used for commercial use (e.g Zomato dataset). Are there any good data sources which I can use …
Category: Data Science

Unable to generate useful insights on a highly cardinal data

I'm working on CRM data, did some cleaning, encoding and ran a decision tree classifier from which i plotted a feature_importance graph From that I found that Sales person column is one of the important feature which is highly cardinal column(around 1300+ categories/sales person). Now i'm trying to generate some insights on this column with respect to target column(binary values). Would like to know in general how to create insights from such a large categorical column? P.S: Other columns are …
Category: Data Science

Find how the properties of an entity affect a certain property of its surrounding

We have a set of things (physical entities): ($A_1$, $A_2$, $A_3$, $A_4$,...). Each of those has certain attributes that we can measure at time $t = 0$. Let $B_i$ represent those attributes, so each of $A_j$ has its own set of attributes ($B_1$, $B_2$, $B_3$,...). With time, the $A_j$ tend to affect a certain property of their surrounding environment. Let $C$ represent that property. Change in $C$ with respect to time can be measured. What we have: Identity (name) of …
Category: Data Science

Approaches on grouping/clustering network device data

So I come from more of a computer science background, and recently have been trying to find a solution to a data-centered problem. I would like to try experimenting different data-science methods on my dataset, but I'd like to decide on which ones are the most interesting, and most importantly, why (and why some are not interesting for that case) : basically, the more you tell me about your thought process, the better, I'm trying to learn from it ! …
Category: Data Science

How to measure retention statistics?

I have a dataset with the ID, name, joining date, leaving date as features. I was asked to measure employee retention and health of it. What can I derive from these? What are some latest trends and examples which I can find relating these? Thanks. I know this is a discussion, but given that I couldn't find this on google search, it would be helpful for someone.
Category: Data Science

counter vector fit transform cosine similarity memory error

count_matrix = count.fit_transform(off_data3['bag_of_words']) I have count_matrix shape with count_matrix.shape (476147, 482824) cosine_sim = cosine_similarity(count_matrix, count_matrix) I think the matrix size is too big to cause this memory error --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) in ~/venv/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output) 1034 1035 K = safe_sparse_dot(X_normalized, Y_normalized.T, -> 1036 dense_output=dense_output) 1037 1038 return K ~/venv/lib/python3.6/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 135 """ 136 if sparse.issparse(a) or sparse.issparse(b): --> 137 ret = a * b 138 if dense_output and hasattr(ret, "toarray"): 139 …
Category: Data Science

Matching new set of data with pre-defined sets

I have sets of data describing sets of levels of requirements needed for certain sets of tasks. The following is a tabulated example: Note that the data values are on a scale from 0 to 10. My problem here is that I have a set of employees whose skills (analysis, patience, comprehension ...) have been analyzed, like the following employee: Analysis --> 8.5 Patience --> 5 Comprehension --> 7 Communication --> 7.5 Creativity --> 8 How to match this employee …
Category: Data Science

How to re-generate dataset df at time t2 while having df at time t1 and cross-sectional dataset df' at time t1 and t2?

I have a travel survey dataset df' collected in 2017 and also 2019. Note that individuals (households here) are not necessarily identical in 2017 and 2019 but their features are. Dataset df' in 2017: household income size delivery A 100K 2 2 B 150K 4 0 Dataset df' in 2019: household income size delivery C 75K 1 1 D 100K 4 5 Now I have another travel survey df (in 2017 only) that has some features in common with df': …
Category: Data Science

How to test likelihood hypothesis on dataset?

How to test the following hypothesis? The larger the fare the more likely the customer is to be travailing alone. Using the data below, how would one be able to test the hypothesis? import seaborn as sns # dataset df= sns.load_dataset('titanic') df[['fare','alone']].head() fare alone 0 7.2500 False 1 71.2833 False 2 7.9250 True 3 53.1000 False 4 8.0500 True UPDATE #subset for alone = True alone = df['fare'].loc[df['alone'] == True] #import Wilcoxon test from scipy.stats import wilcoxon #run wilcoxon test …
Category: Data Science

Looking for data analysis techniques and approach

I'm new into ML and I need to do a data analysis on a dataset which I created myself but I don't know what techniques should I use exactly. Namely, I have a dataset with the following attributes: sensor_id,date,time,lat,log,temperature,noise,noise_dba,pm10,humidity,pm25,relative_humidity,wind_speed,sea_level_pressure,solar_elevation_angle,solar_radiation,pressure,snow,uv,wind_direction,visibility,clouds. It's a dataset from IoT devices that measure noise, air pollution (pm10, pm25) and some other weather characteristics. What I want to achieve now is to prove whether the increased noise means increased air pollution (both pm10 and pm25 increased) or …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.