data-analysis

How do you maximize a left skewed variable in a dataset?

gianni404

2022年6月1日 23:45

I analyzed a dataset of tourist expenses for a country here: https://rpubs.com/lovepeacejoy404/tourist_spending_tanzania and I noticed that the variable total_cost (how much the tourist spent in total) is skewed left because tourists tend to spend as little as possible and then there are few who spend a lot. If I am interested in identifying which aspects of tourism are more profitable and in which it is worthwhile to invest, should I consider for the categorical variables the highest value of the …

Topic: data-analysis

Category: Data Science

Optimal representative participation in the age of AI/ML?

eapo

2022年6月1日 20:54

Just answered the great questions of the 2022 Stack Overflow Developer Survey and earned the Census - Badge, here I see the number of filled surveys and as an enthusiastic beginner in data science I am curious about the optimal representative participation in case of 17,866,773 total users!? I did a quick research, but stuck at the following sentence: "Once the population exceeds 20,000, your sample size will not change very much anymore." Please help me to understand this (agree/disagree …

Topic: ai data-analysis machine-learning

Category: Data Science

How do I get the divided values of two columns that are a result from a groupby method

jibecat

2022年6月1日 11:50

I currently have a dataframe that was made by the following example code df.groupby(['col1', 'col2', 'Count'])[['Sum']].agg('sum') which looks like this col1 col2 Count Sum DOG HUSKY 600 1500 CAT CALICO 200 3000 BIRD BLUE JAY 1500 4500 I would like to create a new column which outputs the division of df['Sum'] and df['Count'] The expected data frame would look like this col1 col2 Count Sum Average DOG HUSKY 600 1500 2.5 CAT CALICO 200 3000 15 BIRD BLUE JAY 1500 …

Topic: data-analysis pandas python

Category: Data Science

Predicting Customer Activity Absence

Andrei

2022年6月1日 06:04

Could you please assist me with to following question? I have a customer activity dataframe that looks like this: It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December). Could you …

Topic: data-analysis dataframe prediction pandas python

Category: Data Science

make new feature based on number of 'likes' and 'release date'

DarthVader8848

2022年5月30日 14:02

I must create new feature based on number of likes and release date. Is it good idea to estimate likes per day? And after It make some range for video popularity. I think that if our video was released a long time ago and this video has 5 likes. Like per day coeff will be high. How can I calculate this coeff properly?

Topic: data-analysis feature-selection

Category: Data Science

Getting below error in time and day split in R

ahm

2022年5月28日 21:12

data$eu_indicator<-as.factor(data$eu_indicator) data$hour<-hour(data$calc_created) data$day<-date(data$calc_created) #data transformation #datetime: split on date and hours error msg:::Error in hour(data$calc_created) : could not find function "hour" Error in date(data$calc_created) : unused argument (data$calc_created)

Topic: data-analysis r

Category: Data Science

How to return the number of values that has a specific count

jibecat

2022年5月28日 19:08

I would like to find how many occurrences of a specific value count a column contains. For example, based on the data frame below, I want to find how many values in the ID column are repeated twice | ID | | -------- | | 000001 | | 000001 | | 000002 | | 000002 | | 000002 | | 000003 | | 000003 | The output should look something like this Number of ID's repeated twice: 2 The ID's …

Topic: data-analysis pandas python

Category: Data Science

Determine the effect on margins of a price increase

Modvinden

2022年5月26日 19:56

I hope you can help guide me in the right direction! Any advice is appreciated! Situation I'm currently analyzing the effect of a price increase from a retailer on a few 100 products. I'm interested in understanding the effect of the price increase on volume, sales value, and margin. The data I have available is weekly product-level data in terms of sales value, volume, and margin for products that had a price increase and for products that did not have …

Topic: data-analysis regression time-series

Category: Data Science

Data Analytics how to read ECDF graph

Yavuz Bozkurt

2022年5月25日 18:08

Hi there, My question is about how to read ECDF graphs. I am still quite unsure what the jumps / zig-zags in the graph mean and what is happening when there is a horizontal line and so on. I would be happy if someone can explain me how I am suppose to read this graph and what information I can get from it. Thank you

Topic: data-analysis data graphs

Category: Data Science

Are there any open datasets for commercial use?

hyeri

2022年5月24日 16:06

I am creating a bootcamp for data analyst and it's been 2 days I am looking for some good dataset fit for commercial use that I can use to create Tableau and Power BI tutorials. Even on kaggle some datasets are licensed as CC0 but when you track back the company the data was scrapped from, it states that the data shouldn't be used for commercial use (e.g Zomato dataset). Are there any good data sources which I can use …

Topic: data-analysis powerbi data tableau

Category: Data Science

Unable to generate useful insights on a highly cardinal data

dark_rush

2022年5月24日 06:21

I'm working on CRM data, did some cleaning, encoding and ran a decision tree classifier from which i plotted a feature_importance graph From that I found that Sales person column is one of the important feature which is highly cardinal column(around 1300+ categories/sales person). Now i'm trying to generate some insights on this column with respect to target column(binary values). Would like to know in general how to create insights from such a large categorical column? P.S: Other columns are …

Topic: data-science-model data-analysis visualization python machine-learning

Category: Data Science

Find how the properties of an entity affect a certain property of its surrounding

Hitanshu Sachania

2022年5月20日 21:53

We have a set of things (physical entities): ($A_1$, $A_2$, $A_3$, $A_4$,...). Each of those has certain attributes that we can measure at time $t = 0$. Let $B_i$ represent those attributes, so each of $A_j$ has its own set of attributes ($B_1$, $B_2$, $B_3$,...). With time, the $A_j$ tend to affect a certain property of their surrounding environment. Let $C$ represent that property. Change in $C$ with respect to time can be measured. What we have: Identity (name) of …

Topic: data-analysis clustering machine-learning

Category: Data Science

Approaches on grouping/clustering network device data

Tahaga

2022年5月12日 15:25

So I come from more of a computer science background, and recently have been trying to find a solution to a data-centered problem. I would like to try experimenting different data-science methods on my dataset, but I'd like to decide on which ones are the most interesting, and most importantly, why (and why some are not interesting for that case) : basically, the more you tell me about your thought process, the better, I'm trying to learn from it ! …

Topic: data-analysis data data-mining

Category: Data Science

How can data science teams inside businesses measure costs and efficiency of their technical work?

Guest

2022年5月10日 01:06

How can data science teams measure and improve costs of their technical work, when they often don't know the monetary value of the datasets and insights they are producing? Are they using industry based benchmarks for technical development, and some subjective measurement for business insight creation?

Topic: data-analysis

Category: Data Science

How to measure retention statistics?

yawwml

2022年5月1日 21:03

I have a dataset with the ID, name, joining date, leaving date as features. I was asked to measure employee retention and health of it. What can I derive from these? What are some latest trends and examples which I can find relating these? Thanks. I know this is a discussion, but given that I couldn't find this on google search, it would be helpful for someone.

Topic: data-analysis dataset data-mining

Category: Data Science

counter vector fit transform cosine similarity memory error

slowmonk

2022年5月1日 11:01

count_matrix = count.fit_transform(off_data3['bag_of_words']) I have count_matrix shape with count_matrix.shape (476147, 482824) cosine_sim = cosine_similarity(count_matrix, count_matrix) I think the matrix size is too big to cause this memory error --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) in ~/venv/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output) 1034 1035 K = safe_sparse_dot(X_normalized, Y_normalized.T, -> 1036 dense_output=dense_output) 1037 1038 return K ~/venv/lib/python3.6/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 135 """ 136 if sparse.issparse(a) or sparse.issparse(b): --> 137 ret = a * b 138 if dense_output and hasattr(ret, "toarray"): 139 …

Topic: data-analysis cosine-distance nlp machine-learning

Category: Data Science

Matching new set of data with pre-defined sets

Georgio Sayegh

2022年4月24日 05:04

I have sets of data describing sets of levels of requirements needed for certain sets of tasks. The following is a tabulated example: Note that the data values are on a scale from 0 to 10. My problem here is that I have a set of employees whose skills (analysis, patience, comprehension ...) have been analyzed, like the following employee: Analysis --> 8.5 Patience --> 5 Comprehension --> 7 Communication --> 7.5 Creativity --> 8 How to match this employee …

Topic: data-analysis regression

Category: Data Science

How to re-generate dataset df at time t2 while having df at time t1 and cross-sectional dataset df' at time t1 and t2?

Armin Mir

2022年4月21日 21:34

I have a travel survey dataset df' collected in 2017 and also 2019. Note that individuals (households here) are not necessarily identical in 2017 and 2019 but their features are. Dataset df' in 2017: household income size delivery A 100K 2 2 B 150K 4 0 Dataset df' in 2019: household income size delivery C 75K 1 1 D 100K 4 5 Now I have another travel survey df (in 2017 only) that has some features in common with df': …

Topic: data-analysis dataset machine-learning

Category: Data Science

How to test likelihood hypothesis on dataset?

IOIOIOIOIOIOI

2022年4月18日 13:38

How to test the following hypothesis? The larger the fare the more likely the customer is to be travailing alone. Using the data below, how would one be able to test the hypothesis? import seaborn as sns # dataset df= sns.load_dataset('titanic') df[['fare','alone']].head() fare alone 0 7.2500 False 1 71.2833 False 2 7.9250 True 3 53.1000 False 4 8.0500 True UPDATE #subset for alone = True alone = df['fare'].loc[df['alone'] == True] #import Wilcoxon test from scipy.stats import wilcoxon #run wilcoxon test …

Topic: hypothesis-testing data-analysis probability python

Category: Data Science

Looking for data analysis techniques and approach

anthino12

2022年4月10日 15:58

I'm new into ML and I need to do a data analysis on a dataset which I created myself but I don't know what techniques should I use exactly. Namely, I have a dataset with the following attributes: sensor_id,date,time,lat,log,temperature,noise,noise_dba,pm10,humidity,pm25,relative_humidity,wind_speed,sea_level_pressure,solar_elevation_angle,solar_radiation,pressure,snow,uv,wind_direction,visibility,clouds. It's a dataset from IoT devices that measure noise, air pollution (pm10, pm25) and some other weather characteristics. What I want to achieve now is to prove whether the increased noise means increased air pollution (both pm10 and pm25 increased) or …

Topic: data-analysis correlation machine-learning

Category: Data Science

About