Rigor Theory. I wish to learn the scientific method and how to apply it in machine learning. Specifically, how to verify that a model captured the pattern in data; how to rigorously reach conclusions based on well-justified empirical evidence. Verification in Practice. My colleagues in both academia and industry tell me measuring the accuracy of the model on testing data is sufficient, but I don't feel confident such criteria are sufficient. Data Science Books. I have picked up multiple data …
A common method for conduction data science projects is CRISP - https://www.datascience-pm.com/crisp-dm-2/. However in job descriptions they are often combine data science with agile methods, mostly SCRUM. How fits data science together with Agile, Scrum? I get that CRISP and Scrum both use the cycle as a way to approach the end result, but there is a lot of different methodology and terminology. Any ideas or hints for further readings?
I am asking myself, if there is another good method than deep artificial neural networks when trying to classify data with many (>100) labels. Are there any suggestions? For example, logistic regression does not seem to fit, as - in its basic form, it only supports two labels, does it?
I am a little confused between these two parts of Keras sequential models functions. May someone explains what is exactly the job of each one? I mean compile doing forward pass and calculating cost function then pass it through fit to do backward pass and calculating derivatives and updating weights? Or what? I have seen in some codes, they only used compile function for some of their LSTMs and fit for some other ones! So I need to know each …
For processes of discrete events occurring in continuous time with time-independent rate, we can use count models like Poisson or Negative Binomial. For discrete events that can occur once per sample in continuous time, with a time-dependent rate, we have survival models like Cox Proportional Hazards. What can we use for discrete event data in continuous time where there is an explicit time-dependence that we want to learn? I understand that sometimes people use sequential models where each node is …
I'm somewhat new to machine learning and have learned to apply many of the basic regression and classification methods using python and various packages. However, approaching this problem has me stumped. To illustrate the problem, I created a fictitious scenario where a guidance counselor wants to predict test scores for a student after disciplinary action. Suppose they have data available like the mock-up below: Column definition: Student - Student Identification # Gender - Male/Female Age - Current Age Athlete - …
I am very much concerned about biding by research ethics in my work, especially issues to do with plagiarism. I come across a recent research paper in my field of study that applies state-of-the-art tools (deep learning architectures) in their work using a publically available dataset. I am impressed by their work and feel I should apply the same methodology they used but using my dataset (private). Would this be considered a plagiarised version of their work?
From examples outside of the documentation, I thought I understood the examples of the .where() method. Basically, it seems to be a another way to filter a dataframe. However, when I checked the documentation itself for an example of how to use .where(), it was counterintuitive. The documentation provides this example: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}) df.where(lambda x: x > 4, lambda x: x + 10) [output]: A B C 0 …
I want to drop a range of rows and columns of a dataframe, I did it as follow: df.drop(df.columns[3:], axis=1, inplace=True) df.drop(df.index[3:], axis=0, inplace=True) Can I do the two processes in one method instead of two? Or is there any more sufficient way to accomplish this?
I've got a time-series data (let's denote it as y) and some feature (let's denote it as x). y is dependent on x, but x is often equal to 0. Even then, y is not 0, so we can assume that there's a base level in y which is independent of x. Additionally, we can observe some seasonality in y. I need to decompose y into base level and an effect of x. And I need some hint about methodology. …
Vector Autoregressive models are exploited at Economics faculties all around the world. They are just another statistical model that solves problem of forecasting, although in a deeply complexity-uncovering manner. Yet to my surprise, there is no evidence it has been used outside pure economics domain, namely, to solve business problems like we all - Data Scientists - do. Can you share either your experience with application of VAR to solve business problem, a scenario in which it could hypothetically be …
I have a dataset for a supervised learning task. Each row is a vector with a value of pixmap value in a range [0,255] of gray colormap, each vector is labeled with a character. I have to assign each vector with a character. My Question: What are some methods that I can try to pre-process the data to gain better accuracy?
I am trying to stitch together multiple packages and tools from multiple languages (R, python, C etc.) in a single analysis workflow. Is there any standard way to do it? Preferably (but not necessarily) in python.
Mondobrain proposes a "big data" technology with: a new generation of algorithms based on A. Grothendieck scheme theory (Field Medal) that extract knowledge and rules from data without any model or distance, and that can explore every part of multi-dimensional spaces independently. What is behind this method that does not use "model of distance"? Are there relations methods like Topological Data Analysis (evoked at math stackexchange)?
I'm falling in love with data science and I'm spending a lot of time studying it. It seems that a common data science workflow is: Frame the problem Collect the data Clean the data Work on the data Report the results I'm struggling to connect the dots when comes to work on the data. I'm aware that step 4 is where the fun happens, but I don't know where to begin. What are the steps taken when you work on …
I have the following problem that I don't know how to solve: I have the data for different market baskets with a corresponding class. So for example I know: Student - {beer, milk, water} Professional - {nuts, pizza, bananas} Student - {oranges, tomatoes, beer} ... Is there a method to create a classification model so that I can use the content of the market basket in order to determine the corresponding class (Student, Professional, ...)? Thank you!
What are the best known Data Science Methodologies today? By methodology I mean a step-by-step phased process that can be used for framing guidance, although I will be grateful for something close too. To help clarify, there are methodologies in the programming world, like Extreme Programming, Feature Driven Development, Unified Process, and many more. I am looking for their equivalents, if they exist. A google search did not turn up much, but I find it hard to believe there is …
I came across an SVM predictive model where the author used the probabilistic distribution value of the target variable as a feature in the feature set. For example: The author built a model for each gesture of each player to guess which gesture would be played next. Calculating over 1000 games played the distribution may look like (20%, 10%, 70%). These numbers were then used as feature variables to predict the target variable for cross-fold validation. Is that legitimate? That …