Multicolinear Predictors Effect on Model

I know that multicolinear predictors in a model aren't ideal because it causes the model to be sensitive to very minor changes, which then reduces our ability to interpret the effects of each predictor from its coefficient. However, I don't understand why the model becomes sensitive and how the estimated coefficients can vary wildly from just a very minor change in the dataset. Also, does multicolinear predictors affect the accuracy / error on a prediction? Or does it purely affect …
Category: Data Science

Inspect false classified

Recently, I was able to train a simple classification algorithm (my first ML-Project) and I even got a pretty satisfying precision score. Now I am looking for a way to inspect, which datapoints in my train_data have been falsely classified. My basic idea was something like: If y_train != y_pred Then: (get indices of y_train) (look up the data in my csv and try to find a pattern) My main problem is, that the train_test_split function provides me with a …
Category: Data Science

Beginner needs guidance. Machine Learning, preparing training data

i try to dip my feet into the field of computer vision and want to avoid mistakes along the way. The problem I have to solve: Classifiy images of 3D dental scans. For example: I wrote a script to create images of theses files in blender so i have full control over the image dimensions, quality, resolution ect. Now to my questions: Whats the best way to prepare a training dataset if you have full control over the process? Higher …
Category: Data Science

Training, Validation, and Testing Data in Supervised Learning

I've come up with some simple definitions for training, testing and validation data in supervised learning. Can anyone verify/improve upon my answers? Training Data - Used by the model to learn parameters and 'fit' to the data (usually involves multiple models fit at once) Validation Data - Used by the model to either a) determine the best hyperparameter(s) for a given model or b) determine the best performing model out of a given selection or c) determine the best hyperparameters …
Category: Data Science

Various models giving 99% accuracy for KDDcup 99 dataset for Intrusion Detection, is there some sort of data leak I am missing?

Student who is quite new to all this here. I am currently working with the KDDcup 99 data for intrusion detection using various ML models (and ANN). My problem is that I am getting 99% often for accuracy. At the moment I am focusing mostly on binary classification (normal vs attack) I have identified problems in my data preprocessing methods and after fixing them I am more confident in the validity of my input data but I am still getting …
Category: Data Science

How to apply class weight to a multi-output model?

I have a model with 2 categorical outputs. The first output layer can predict 2 classes: [0, 1] and the second output layer can predict 3 classes: [0, 1, 2]. How can I apply different class weight dictionaries for each of the outputs? For example, how could I apply the dictionary {0: 1, 1: 10} to the first output, and {0: 5, 1: 1, 2: 10} to the second output? I've tried to use the following class weights dictionary weight_class={'output1': …
Category: Data Science

Is NLP suitable for my legal contract parsing problem?

My company has a product that involves the extraction of a variety of fields from legal contract PDFs. The current approach is very time consuming and messy, and I am exploring if NLP is a suitable alternative. The PDFs that need to be parsed usually follow one of a number of "templates". Within a template, almost all of the documents are the same, except for 20 or so specific fields we are trying to extract. That being said, there are …
Category: Data Science

How do I fine-tune model performance after the initial run? (Scikit-Learn)

I've just started learning regression using scikit-learn and stumbled upon a problem. For a given dataset, let's say that I've imputed the missing data and one-hot encoded all categorical features. This point is where it starts getting confusing for me. After hot-encoding categorical features, I usually end up with a lot of columns. How do I know that all of these columns benefit the model's performance? If not, how can I determine which columns/features to keep? Is there a method …
Category: Data Science

How do CNNs use a model and find the object(s) desired?

Background: I'm studying CNN's outside of my undergraduate CS course on ML. I have a few questions related to CNNs. 1) When training a CNN, we desire tightly bounded/cropped images of the desired classes, correct? I.e. if we were trying to recognize dogs, we would use thousands of images of tightly cropped dogs. We would also feed images of non-dogs, correct? These images are scaled to a specific size, i.e. 255x255. 2) Let's say training is complete. Our model's accuracy …
Category: Data Science

For a student who is a beginner in quantitative research and statistics, which is the better statistical tool to start: R or IBM SPSS? Why?

Currently, I am writing my research design. However, I am still indecisive on what statistical tool should I use for the data analysis. I tried looking up on the internet and there are disparate answers to my question. I have noticed that R (Programming Language) and IBM Statistical Package for the Social Sciences are two of the recurring tools that are mentioned when it comes to this question. So, which is better? I need some insights so I can settle …
Category: Data Science

Group related items by their description and tag each group. [Pen, Eraser] : Stationary

So have a list of data similar to the table below. It will be captured by a chatbot so I expect natural language but in the form of a structured command: Add {Qty} {item description} to {location} ID Owner Item Description Location Qty Image 1 Somenick Green apple fridge 1 1.jpg 2 Somenick Jewelry toy box bedroom 2 2.jpg 3 Somenick 12kg rubber quoted grey kettlebell bedroom 1 3.jpg 4 Astrod 60cm never used helmet closet 1 4.jpg 5 Atrod …
Category: Data Science

Does Bias always decrease when Complexity increase?

(I'm just starting learning about ML stuff and so please don't be rude if the following question is to stupid or totally wrong) I'm reading about Bias-Variance Trade off and I don't understand the (probably) most important part: why its a tradeoff? I totally get that the generalization error can be decomposed in 3 parts, an irreducible error due to the noise in our data, a Bias term and a Variance term. In some cases I have a model with …
Category: Data Science

kMean clustering for recommendation

I have a file with 50000 rows from a library platform. Each individual row saves a user, and shows the order in which the user, has selected. The books could be from various categories (e.g. roman, history, etc..). There are a total of 10 categories. The categories that user has selected could be for example: 334664. This means this user has selected a book from categories 3, 4 and 6. How can I use this data to build a recommendation …
Category: Data Science

I can't figure out how to improve accuracy for tweet sentiment

I'm doing a beginning attempt at tweet sentiment analysis (positive, neutral, negative). So far I have cleaned the data and used a BoW to get some feeling of the data (>2.5k tweets). I also made bigrams to try to get clearer sentiment insight. The data is severely skewed so I tried both upsampling and downsampling to view the difference. I finally passed it all through a Random Forest Classifier and I get an accuracy of 0.7 for the upsampled data …
Category: Data Science

Model with 2 datasets: combine time series data and statistics

I am new to data science modelling so apologies if using wrong terminology in advance. I have a standard time series dataset of historical prices which is used to train/test a simple Random Forest classifier model which predicts the returns direction (+/-). I also have a few general statistics for 'day of the week direction' eg. frequency counts: Monday UP=120, Monday DOWN=90, Tuesday UP=67, Tuesday DOWN=50, Friday UP=55, Friday DOWN=181. How can I combine the results from the time series …
Category: Data Science

Tableau: Trying to determine the category of one table based on the dynamic aggregate of another table for a Tableau Dashboard

I have one table that contains unique rows for all my quote requests: Quote_ID 1234 1235 1236 1237 1238 in a second table that I've joined (1-0*) with a relationship, I have referrals. These referrals represent reasons why the quote should be referred to an expert, but could represent any other attribute for the sake of the problem. Every referral has a key to the Quote_ID, a unique Referral_ID and a name: Quote_ID Referral_ID Referral_name 12345 1 too many X …
Category: Data Science

Activity in fermenter has increased suddenly after 2 weeks, why?

I'm a total beginner and really appreciative of any advice here. I'm making a 1 gallon batch of IPA from a kit. I brewed 16 days ago, and in the first 48 hours of the wort being in the fermenter it was bubbling away like crazy. It then settled down and I could see the liquid become less opaque and darker, as the yeast cake formed at the bottom. Today I went to look again as I had planned to …
Category: Mac

Caramel Coffee Mead

I'm looking into brewing a mead in the near future. I have exactly 0 experience brewing anything. I'm going to buy a store-bought brewing kit in the next few weeks if everything lines up. I'm wondering if anyone's ever made a Caramel Coffee Mead, and if so do you have a recipe, or any tips on how to make it work. I want at least an 8% ABV, but no more than 16%. I'm looking for it to be a …
Category: Mac

Data science without knowledge of a specific topic, is it worth pursuing as a career?

I had a conversation with someone recently and mentioned my interest in data analysis and who I intended to learn the necessary skills and tools. They suggested to me that while it is great to learn the tools and build the skills there is little point in doing so unless i have specialized knowledge in a specific field. They basically summed it to that I'd just be like a builder with a pile of tools who could build a few …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.