So I plan on making a mobile app that will let students predict their final grades based on their mock exam results. I can train my model with previous years results. X: 5 mock results Y: Final grade obtained However, I have the issue that sometimes, or most the times, the user may be using the app whilst not having taken ALL the mock exams yet, they may want to see if they are on track and use it once …
In a data classification problem (with supervised learning), what should be the ideal difference in the training set accuracy and testing set accuracy? What should be the ideal range? Is a difference of 5% between the accuracy of training and testing set okay? Or does it signify overfitting?
I am going to build machine learning algorithm to identify fake tweets. The data set has huge retweets which I think might be an issue. Do you think given that the focus is the original tweet, it is better to remove all the retweets? Thank you,
I have 2 ddbb with around 60,000 samples each. Both have the same features (same column names) that represent particular things with text or categories (turned into numbers). Each sample in a ddbb is assumed to refer to a different particular thing. But there are some objects that are represented in both ddbb, yet with somewhat different values in the same-name column (like different open descriptions, or classified as another category). The aim is to train a machine learning model …
I'm handling an very conventional supervised classification task with three (mutually exclusive) target categories (not ordinal ones): class1 class2 class2 class1 class3 And so one. Actually in the raw dataset the actual categories are already represented with integers, not strings like my example, but randomly assigned ones: 5 99 99 5 27 I'm wondering whether it is requested/recommended to re-assign zero-based sequential integers to the classes as labels instead of the ones above like this: 0 1 1 0 2 …
I'm using a dataset contains about 1.5M document. Each document comes with some keywords describing the topics of this document(Thus multi-labelled). Each document belongs to some authors(not just one author for a document). I wanted to find out the topics interested by each author by looking at documents they write. I'm currently looking an LDA variation (labeled-LDA proposed by D Ramaga: https://www.aclweb.org/anthology/D/D09/D09-1026.pdf .). I'm using all the documents in my dataset to train a model and using the model to …
I am new to Machine Learning and Data Science. By spending some time online, I was able to understand the perceptron learning rule fairly well. But I am still clueless about how to apply it to a set of data. For example we may have the following values of $x_1$, $x_2$ and $d$ respectively:- \begin{align}&(0.6 , 0.9 , 0)\\ &(-0.9 , 1.7 , 1)\\ &(0.1 , 1.4 , 1)\\ &(1.2 , 0.9 , 0)\end{align} I can't think of how to …
I am working on a relation extraction and classification problem. The data is in the form of text files. The data is imbalanced. I want to use focal loss function to address class imbalance problem in the data. My question is: Can focal loss be utilized for extraction and classification task to increase the accuracy? Focal loss has been applied on object detection task and for image classification task. The link is below. I want to use this on text …
I have a database holding 10-ish features that describe different breeds of dogs. They are mostly categorical features, but some provide ranges for values. Here's a demo representation of the database, showing the mixture: |Breed|Min_Height|Max_Height|Min_Weight|Max_Weight|sub_cat|is_friendly| |---------------------------------------------------------------------| |Dober|20 |20 |40 |52 |sport |FALSE | |Pood |15 |25 |35 |45 |water |TRUE | ... As you can see, the data is mixed and the ranges have some overlap from entry to entry. Say I receive an input of: |height|weight|sub_cat|is_friendly| |---------------------------------| |16 |43 …
I'm a beginner and I have a question. Can clustering results based on probability be used for supervised learning? Manufacturing data with 80000 rows. It is not labeled, but there is information that the defect rate is 7.2%. Can the result of clustering by adjusting hyperparameters based on the defect rate be applied to supervised learning? Is there a paper like this? Is this method a big problem from a data perspective? When using this method, what is the verification …
The response variable in a regression problem, $Y$, is modeled using a data matrix $X$. In notation, this means: $Y$ ~ $X$ However, $Y$ can be separated out into different components that can be modeled independently. $$Y = Y_1 + Y_2 + Y_3$$ Under what conditions would $M$, the overall prediction, have better or worse performance than $M_1 + M_2 + M_3$, a sum of individual models? To provide more background, the model used is a GBM. I was surprised …
I am trying to design an algorithm that takes in a new user with the variables department, location, job_role etc. and I want a machine-learning algorithm to decide what software and hardware this new user would need. I am rattling my brain thinking how I could get this to work - I could use a supervised learning approach and train a model with a dataset of already employed users and the software and hardware they use, however, the variables in …
I’m using the supervised learning method with an LSTM network to predict forex prices. To achieve this I’m using deeplearning4j library but I doubt several points of my implementation. I turned off the mini batch feature, then I created many trading indicators from forex data. The point is to provide random chunks of data to the neural network on every epoch and ensure that after every epoch the network state was cleaned. To achieve this I created a dataset iterator …
I'm using supervised learning on monthly activity data to predict when a customer buys a particular product. This product is typically bought infrequently and at the moment my target variable is whether the customer buys the product in the next twelve months. Assume that for every customer I get a set of features every month, $x_1,x_2,\ldots,x_n$. The goal is to use these features to predict whether $y=0$ or $y=1$ ($y$ is 1 if the customer did buy the product in …
Disclaimer: Mathematicians please don't be mad at me for the use of some of the terminologies in this post. I am an Engineer. :-) Background: So I am currently working on a problem where I have to generate a time series sequence of a process in which n actors are moving in a 2d space. But i don't know if this is even possible .The process being learned by some machine learning model M. BTW! I have never worked with …
I am currently working on an lbsn (localization-based social network) system and i need to predict the user's age and gender. Every time a user enters a venue, the system creates a "check-in" with the user, the venue and the datetime. Every venue is categorized using Foursquare Venue Categories. The system generate a Weigthed Concept Hierarchy to represent the interest level between a user and a Venue Category. Is it possible to predict the user's age and gender using the …
Lets say I have 100 values in my dataset and split it 80% train 20% test. When predicting the last value, is the prediction based on previous 99 (80 test + 19 already predicted values) or only the original 80 train values? For example: if kd-tree is used, is every data point inserted into the tree during the prediction? Is it possible to use knn for the following scenario? I have 20 train values, when I add new observation I …
I am doing anomaly detection recently, one of the methods is using AEs model to learn the pattern of normal samples. Determine it as an abnormal sample if it doesn’t match the pattern of normal samples. I train AE without labels but we need to use ‘label’ to determine which sample is normal or abnormal. I am wondering what kind of this training is supervised learning,semi-supervised learning or unsupervised learning?
My current dataset has a shape of 5300 rows by 160 columns with a numeric target variable range=[641, 3001]. That’s no big dataset, but should in general be enough for decent regression quality. The columns are features from different consecutive process steps. The project goal is to predict the numerical variable, with the satisfactory object to be very precise in the area up too 1200, which are 115 rows (2,1%). For target variables above 1200 the precision can be lower …
I have a large batch of email data that I want to analyse. In order to do that, I need to first prepare the data, as the messages are quite often >80% noise. Generally speaking, my dataset's structure is nowhere near that of the ENRON dataset. I need to get rid of signatures, headers and, most importantly, automatically appended legal / security disclaimers. I have been doing some research and so far I've seen two supervised learning approaches to this …