Suppose I seek to predict a certain numerical value, whereby the data set which contains the predetermined correct labels is only very small. However, I'm also provided a large data set with a label that is correlated to the one I want to predict. I read that transfer learning could be used to make use of this larger data set for predicting the desired label from the smaller data set. Could someone elaborate a bit on this?
Lets say i am predicting house selling prices (continuous) and therefore have multiple independent variables (numerical and categorical). Is it common practice to balance the dataset when the categorical independent variables are imbalanced? Ratio not higher 1:100. Or do i only balance the data when the dependent variable is imbalanced? Thanks
I have a dataset with employee data with around 9500 rows, and have to predict if the target is 0 or 1. Some of my features are the department of an employee, gender, salary, review_score(numerical), average_number_of_hours per month, bonus(1 or 0), number of projects an employee is involved in, and tenure. I have a question if number of projects (3,4,5,6) and tenure(2,3,4,5,6,7,8,9,10,11,12) should be treated as 'categories' rather than numerical values. I can make them ordinal. However, I am not …
As input I have a CSV with 5000 lines (and growing) and 20 fixed columns containing a number from 1-80. A row may look like this. Is it possible using Orange3 to analyze each row and find out what pairs, tripes, quads, quints, etc. occur the most often on a row? The output I am looking to get is "these 2 numbers occur the most often on a row" "These 3 numbers occur the most often on a row", "These …
I know how to separate numerical and categorical data as follows: num_data = [cname for cname in df.columns if df[cname].dtypes == 'object'] cat_data = [cname for cname in df.columns if df[cname].dtypes in ['int64', 'float64']] Now I want to separate my numerical variables into discrete and continuous. How do I do that?
I have a categorical variable with 4 levels ('8 c', '6 c','NAN','Others') and I want to convert it to numerical form. an Obvious way is to simply remove the 'c' part from the first two categories and replace NAN with 0. However, I was wondering about the 'Others' level? What could be the best way to transform this level? Please note that the variable represents the number of cylinders for a given car.
To analyze a dataset from banking I have both numerical and categorical values. I transform them to analyze with k-prototypes. The original dataset: The modified dataset: E.g.: Job (for 1 to 12 'cos there are 12 levels) Should I scale the dataset before doing the k-prototypes? How could I determine the optimal "k" to choose (coding)? I thought to execute: library(clustMixType) lbd <- lambdaest(BPor) kpres <- kproto(BPor, 5, lambda = lbd) #Change '5' for every possible value of k. print(kpres) …
I have a dataset (42000, 10) which contains 7 categorical features and 3 numerical. I would like to separate both the numerical and categorical features into 2 different data frames i.e I would like 2 data frames where one contains only numerical data (42000, 3) and the other only categorical data (42000, 7), perform some pre-processing on both of them, and lastly concatenate them into one data frame. So, my question is how do I separate my initial dataframe into …
I am trying to change continuous data points to categorical by using binning. I know two techniques, i) equal width bins ii) bins with equal number of elements. My questions are: Which type of binning is appropriate for which kind of problem? I use pandas for my data analysis task and it has pd.cut method for arbitrary binning which I use for equal wdith bins and pd.qcut method for bins with equal number of elements. The second function always produces …
I have a linear numerical array source and I want to find/match test array as pattern : source = [39,36,23,21,28,36,30,22,34,37] test = [36,23,21,28] we can use brute force or similar method for finding the exact match, by checking test array from index 0 to len(source)-len(test) but in our problem, we can accept this pattern too ( order is important ) test = [36,24,21,28] // changed 23 to 24 since we have many different ways of solving this problem ( maybe …
Nothing to do with number recognition in the classical 'hand-written' sense Disclaimer above to avoid this being counted as a repeat. I have a selection of 96 serial numbers, and a separate selection of >220 serial numbers. Within the larger set typically resides the smaller set (not always though), but also ~ 120 incorrect numbers. See below for an example - for the record I have matched things up as best as I can... the correct number is first, the …
Basically I was looking for a normalization function part of sklearn, which is useful later for logistic regression. Since I have negative values, I chose MinMaxScaler with: feature_range=(0, 1) as a parameter. x = MinMaxScaler(feature_range=(0, 1)).fit_transform(x) Then using sm.Logit trainer I got and error, import statsmodels.api as sm logit_model=sm.Logit(train_data_numeric_final,target) result=logit_model.fit() print(result.summary()) ValueError: endog must be in the unit interval. I presume my values are out of (0,1) range, which is the case: np.unique(np.less_equal(train_data_numeric_final.values, 1)) array([False, True]) How come? then how …
I was reading through a notebook tutorial working with the Titanic dataset, linked here, and noticed that they highly favored ordinal data to continuous data. For example, they converted both the Age and Fare features into ordinal data bins. I understand that categorizing data like this is helpful when doing data analytics manually, as fewer categories makes data easier to understand from a human perspective. But intuitively, I would think that doing this would cause our data to lose precision, …
Is it better to encode features like month and hour as factor or numeric in a machine learning model? On the one hand, I feel numeric encoding might be reasonable, because time is a forward progressing process (the fifth month is followed by the sixth month), but on the other hand I think categorial encoding might be more reasonable because of the cyclic nature of years and days ( the 12th month is followed by the first one). Is there …
I need to do a cluster analysis for the following variables: Trickquestion answer: Good/Wrong count variable : range 0-9 time in minutes count variable Number of observations: 3300 Since I am new to cluster algorithms I'm struggling with choosing the best cluster algorithm. I have read about the following methods: k prototypes k means with Gower's distance PAM algorithm. For the cluster analysis I need to use R. Can someone give advice about which methods suits the data best. Since …
Conceptually, I understand how a numerical method like Monte Carlo is used to solve a definite integral. Because integral of a function is the area bounded by the curve, the ratio of random points that land inside the curve to the total number of points is the value of the integral. Conceptually, can someone explain for a non math person, how we can solve a PDE/ODE using Monte Carlo?
I want to replace the values in a data set (sample in the picture) using numbers instead of words, e.g., 1 instead of D, -1 instead of R, 0 for all other values. How can I do it with a loop? I know it can be done doing this instead: (suppose d is index name) d[d$Response == "R",]$Response = -1 d[d$Response == "D",]$Response = 1 ... (other values code it and assign value of) = 0
I am trying to develeop an algorithm with sklearn and Tensorflow to predict which car can be offer to each customer. To do that I have a database with the answers of one survey to 1000 customers. An example of questions/[Answers] are: Color/[Green,Red,Blue] NumberOfPax/[2,4,5,6,7] HorsePower/[Integer] InsuranceIncluded[yes/no/Don't know] As you can see all questions are answer previously tipified, and in case the answer can be open I validate the value to be an integer or a radio button. The purpose of …
For the sake of learning the finer details of a deep learning neural network, I have coded my own library with everything (optimizer, layers, activations, cost function) homemade. It seems to work fine when benchmarking in on the MNIST dataset, and using only sigmoid activation functions. Unfortunately I seem to get issues when replacing these with relus. This is what my learning curve looks like for 50 epochs on a training dataset of ~500 examples: Everything is fine for the …