Dealing with little available data: transfer learning

Suppose I seek to predict a certain numerical value, whereby the data set which contains the predetermined correct labels is only very small. However, I'm also provided a large data set with a label that is correlated to the one I want to predict. I read that transfer learning could be used to make use of this larger data set for predicting the desired label from the smaller data set. Could someone elaborate a bit on this?
Category: Data Science

Over-sampling when predicting a contionuous variable

Lets say i am predicting house selling prices (continuous) and therefore have multiple independent variables (numerical and categorical). Is it common practice to balance the dataset when the categorical independent variables are imbalanced? Ratio not higher 1:100. Or do i only balance the data when the dependent variable is imbalanced? Thanks
Category: Data Science

Separating numerical and categorical features in a binary classification problem

I have a dataset with employee data with around 9500 rows, and have to predict if the target is 0 or 1. Some of my features are the department of an employee, gender, salary, review_score(numerical), average_number_of_hours per month, bonus(1 or 0), number of projects an employee is involved in, and tenure. I have a question if number of projects (3,4,5,6) and tenure(2,3,4,5,6,7,8,9,10,11,12) should be treated as 'categories' rather than numerical values. I can make them ordinal. However, I am not …
Category: Data Science

With a 5000x20 CSV as data input discover the most common occurrences of numbers in a row

As input I have a CSV with 5000 lines (and growing) and 20 fixed columns containing a number from 1-80. A row may look like this. Is it possible using Orange3 to analyze each row and find out what pairs, tripes, quads, quints, etc. occur the most often on a row? The output I am looking to get is "these 2 numbers occur the most often on a row" "These 3 numbers occur the most often on a row", "These …
Category: Data Science

Separate discrete and continuous variables

I know how to separate numerical and categorical data as follows: num_data = [cname for cname in df.columns if df[cname].dtypes == 'object'] cat_data = [cname for cname in df.columns if df[cname].dtypes in ['int64', 'float64']] Now I want to separate my numerical variables into discrete and continuous. How do I do that?
Category: Data Science

Transforming Categorical to Numerical variable

I have a categorical variable with 4 levels ('8 c', '6 c','NAN','Others') and I want to convert it to numerical form. an Obvious way is to simply remove the 'c' part from the first two categories and replace NAN with 0. However, I was wondering about the 'Others' level? What could be the best way to transform this level? Please note that the variable represents the number of cylinders for a given car.
Category: Data Science

How to choose the optimal k in k-protoypes?

To analyze a dataset from banking I have both numerical and categorical values. I transform them to analyze with k-prototypes. The original dataset: The modified dataset: E.g.: Job (for 1 to 12 'cos there are 12 levels) Should I scale the dataset before doing the k-prototypes? How could I determine the optimal "k" to choose (coding)? I thought to execute: library(clustMixType) lbd <- lambdaest(BPor) kpres <- kproto(BPor, 5, lambda = lbd) #Change '5' for every possible value of k. print(kpres) …
Category: Data Science

Separate numerical and categorical variables

I have a dataset (42000, 10) which contains 7 categorical features and 3 numerical. I would like to separate both the numerical and categorical features into 2 different data frames i.e I would like 2 data frames where one contains only numerical data (42000, 3) and the other only categorical data (42000, 7), perform some pre-processing on both of them, and lastly concatenate them into one data frame. So, my question is how do I separate my initial dataframe into …
Category: Data Science

Problem with binning

I am trying to change continuous data points to categorical by using binning. I know two techniques, i) equal width bins ii) bins with equal number of elements. My questions are: Which type of binning is appropriate for which kind of problem? I use pandas for my data analysis task and it has pd.cut method for arbitrary binning which I use for equal wdith bins and pd.qcut method for bins with equal number of elements. The second function always produces …
Category: Data Science

partial numerical array - pattern matching

I have a linear numerical array source and I want to find/match test array as pattern : source = [39,36,23,21,28,36,30,22,34,37] test = [36,23,21,28] we can use brute force or similar method for finding the exact match, by checking test array from index 0 to len(source)-len(test) but in our problem, we can accept this pattern too ( order is important ) test = [36,24,21,28] // changed 23 to 24 since we have many different ways of solving this problem ( maybe …
Category: Data Science

5 digit number mis-reads analysis

Nothing to do with number recognition in the classical 'hand-written' sense Disclaimer above to avoid this being counted as a repeat. I have a selection of 96 serial numbers, and a separate selection of >220 serial numbers. Within the larger set typically resides the smaller set (not always though), but also ~ 120 incorrect numbers. See below for an example - for the record I have matched things up as best as I can... the correct number is first, the …
Category: Data Science

MinMaxScaler returned values greater than one

Basically I was looking for a normalization function part of sklearn, which is useful later for logistic regression. Since I have negative values, I chose MinMaxScaler with: feature_range=(0, 1) as a parameter. x = MinMaxScaler(feature_range=(0, 1)).fit_transform(x) Then using sm.Logit trainer I got and error, import statsmodels.api as sm logit_model=sm.Logit(train_data_numeric_final,target) result=logit_model.fit() print(result.summary()) ValueError: endog must be in the unit interval. I presume my values are out of (0,1) range, which is the case: np.unique(np.less_equal(train_data_numeric_final.values, 1)) array([False, True]) How come? then how …
Category: Data Science

Purpose of converting continuous data to categorical data

I was reading through a notebook tutorial working with the Titanic dataset, linked here, and noticed that they highly favored ordinal data to continuous data. For example, they converted both the Age and Fare features into ordinal data bins. I understand that categorizing data like this is helpful when doing data analytics manually, as fewer categories makes data easier to understand from a human perspective. But intuitively, I would think that doing this would cause our data to lose precision, …
Category: Data Science

Encoding features like month and hour as categorial or numeric?

Is it better to encode features like month and hour as factor or numeric in a machine learning model? On the one hand, I feel numeric encoding might be reasonable, because time is a forward progressing process (the fifth month is followed by the sixth month), but on the other hand I think categorial encoding might be more reasonable because of the cyclic nature of years and days ( the 12th month is followed by the first one). Is there …
Category: Data Science

Cluster method with binary variable

I need to do a cluster analysis for the following variables: Trickquestion answer: Good/Wrong count variable : range 0-9 time in minutes count variable Number of observations: 3300 Since I am new to cluster algorithms I'm struggling with choosing the best cluster algorithm. I have read about the following methods: k prototypes k means with Gower's distance PAM algorithm. For the cluster analysis I need to use R. Can someone give advice about which methods suits the data best. Since …
Category: Data Science

What is the intuition behind using Monte Carlo to solve a differential equation

Conceptually, I understand how a numerical method like Monte Carlo is used to solve a definite integral. Because integral of a function is the area bounded by the curve, the ratio of random points that land inside the curve to the total number of points is the value of the integral. Conceptually, can someone explain for a non math person, how we can solve a PDE/ODE using Monte Carlo?
Category: Data Science

Replacing words by numbers in multiple columns of a data frame in R

I want to replace the values in a data set (sample in the picture) using numbers instead of words, e.g., 1 instead of D, -1 instead of R, 0 for all other values. How can I do it with a loop? I know it can be done doing this instead: (suppose d is index name) d[d$Response == "R",]$Response = -1 d[d$Response == "D",]$Response = 1 ... (other values code it and assign value of) = 0
Category: Data Science

Convert nominal to numeric variables?

I am trying to develeop an algorithm with sklearn and Tensorflow to predict which car can be offer to each customer. To do that I have a database with the answers of one survey to 1000 customers. An example of questions/[Answers] are: Color/[Green,Red,Blue] NumberOfPax/[2,4,5,6,7] HorsePower/[Integer] InsuranceIncluded[yes/no/Don't know] As you can see all questions are answer previously tipified, and in case the answer can be open I validate the value to be an integer or a radio button. The purpose of …
Category: Data Science

Homemade deep learning library: numerical issue with relu activation

For the sake of learning the finer details of a deep learning neural network, I have coded my own library with everything (optimizer, layers, activations, cost function) homemade. It seems to work fine when benchmarking in on the MNIST dataset, and using only sigmoid activation functions. Unfortunately I seem to get issues when replacing these with relus. This is what my learning curve looks like for 50 epochs on a training dataset of ~500 examples: Everything is fine for the …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.