How can I learn and apply the scientific method in machine learning?

Rigor Theory. I wish to learn the scientific method and how to apply it in machine learning. Specifically, how to verify that a model captured the pattern in data; how to rigorously reach conclusions based on well-justified empirical evidence. Verification in Practice. My colleagues in both academia and industry tell me measuring the accuracy of the model on testing data is sufficient, but I don't feel confident such criteria are sufficient. Data Science Books. I have picked up multiple data …
Category: Data Science

Does the Data Science process (CRISP ) comply with the Agile methodology?

A common method for conduction data science projects is CRISP - https://www.datascience-pm.com/crisp-dm-2/. However in job descriptions they are often combine data science with agile methods, mostly SCRUM. How fits data science together with Agile, Scrum? I get that CRISP and Scrum both use the cycle as a way to approach the end result, but there is a lot of different methodology and terminology. Any ideas or hints for further readings?
Topic: methods
Category: Data Science

Good classifiers when having many labels

I am asking myself, if there is another good method than deep artificial neural networks when trying to classify data with many (>100) labels. Are there any suggestions? For example, logistic regression does not seem to fit, as - in its basic form, it only supports two labels, does it?
Category: Data Science

What do "compile", "fit", and "predict" do in Keras sequential models?

I am a little confused between these two parts of Keras sequential models functions. May someone explains what is exactly the job of each one? I mean compile doing forward pass and calculating cost function then pass it through fit to do backward pass and calculating derivatives and updating weights? Or what? I have seen in some codes, they only used compile function for some of their LSTMs and fit for some other ones! So I need to know each …
Category: Data Science

Modeling count data with time-dependent rate

For processes of discrete events occurring in continuous time with time-independent rate, we can use count models like Poisson or Negative Binomial. For discrete events that can occur once per sample in continuous time, with a time-dependent rate, we have survival models like Cox Proportional Hazards. What can we use for discrete event data in continuous time where there is an explicit time-dependence that we want to learn? I understand that sometimes people use sequential models where each node is …
Category: Data Science

Predicting High-School test scores after a disciplinary action

I'm somewhat new to machine learning and have learned to apply many of the basic regression and classification methods using python and various packages. However, approaching this problem has me stumped. To illustrate the problem, I created a fictitious scenario where a guidance counselor wants to predict test scores for a student after disciplinary action. Suppose they have data available like the mock-up below: Column definition: Student - Student Identification # Gender - Male/Female Age - Current Age Athlete - …
Category: Data Science

Previous work Replication and Research ethics Ask Question

I am very much concerned about biding by research ethics in my work, especially issues to do with plagiarism. I come across a recent research paper in my field of study that applies state-of-the-art tools (deep learning architectures) in their work using a publically available dataset. I am impressed by their work and feel I should apply the same methodology they used but using my dataset (private). Would this be considered a plagiarised version of their work?
Category: Data Science

Interpreting DataFrame.where() documentation

From examples outside of the documentation, I thought I understood the examples of the .where() method. Basically, it seems to be a another way to filter a dataframe. However, when I checked the documentation itself for an example of how to use .where(), it was counterintuitive. The documentation provides this example: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}) df.where(lambda x: x > 4, lambda x: x + 10) [output]: A B C 0 …
Topic: methods pandas
Category: Data Science

drop columns and rows in one line in pandas

I want to drop a range of rows and columns of a dataframe, I did it as follow: df.drop(df.columns[3:], axis=1, inplace=True) df.drop(df.index[3:], axis=0, inplace=True) Can I do the two processes in one method instead of two? Or is there any more sufficient way to accomplish this?
Topic: methods pandas
Category: Data Science

Time-series decomposition to a base level and an effect of another feature

I've got a time-series data (let's denote it as y) and some feature (let's denote it as x). y is dependent on x, but x is often equal to 0. Even then, y is not 0, so we can assume that there's a base level in y which is independent of x. Additionally, we can observe some seasonality in y. I need to decompose y into base level and an effect of x. And I need some hint about methodology. …
Category: Data Science

Can you provide examples of business application of vector autoregressive model?

Vector Autoregressive models are exploited at Economics faculties all around the world. They are just another statistical model that solves problem of forecasting, although in a deeply complexity-uncovering manner. Yet to my surprise, there is no evidence it has been used outside pure economics domain, namely, to solve business problems like we all - Data Scientists - do. Can you share either your experience with application of VAR to solve business problem, a scenario in which it could hypothetically be …
Category: Data Science

What is behind "A. Grothendieck scheme theory" in Mondobrain?

Mondobrain proposes a "big data" technology with: a new generation of algorithms based on A. Grothendieck scheme theory (Field Medal) that extract knowledge and rules from data without any model or distance, and that can explore every part of multi-dimensional spaces independently. What is behind this method that does not use "model of distance"? Are there relations methods like Topological Data Analysis (evoked at math stackexchange)?
Category: Data Science

How do you define the steps to explore the data?

I'm falling in love with data science and I'm spending a lot of time studying it. It seems that a common data science workflow is: Frame the problem Collect the data Clean the data Work on the data Report the results I'm struggling to connect the dots when comes to work on the data. I'm aware that step 4 is where the fun happens, but I don't know where to begin. What are the steps taken when you work on …
Category: Data Science

Perform classification on market basket analysis

I have the following problem that I don't know how to solve: I have the data for different market baskets with a corresponding class. So for example I know: Student - {beer, milk, water} Professional - {nuts, pizza, bananas} Student - {oranges, tomatoes, beer} ... Is there a method to create a classification model so that I can use the content of the market basket in order to determine the corresponding class (Student, Professional, ...)? Thank you!
Category: Data Science

Data Science Methodologies

What are the best known Data Science Methodologies today? By methodology I mean a step-by-step phased process that can be used for framing guidance, although I will be grateful for something close too. To help clarify, there are methodologies in the programming world, like Extreme Programming, Feature Driven Development, Unified Process, and many more. I am looking for their equivalents, if they exist. A google search did not turn up much, but I find it hard to believe there is …
Topic: methods
Category: Data Science

Can distribution values of a target variable be used as features in cross-validation?

I came across an SVM predictive model where the author used the probabilistic distribution value of the target variable as a feature in the feature set. For example: The author built a model for each gesture of each player to guess which gesture would be played next. Calculating over 1000 games played the distribution may look like (20%, 10%, 70%). These numbers were then used as feature variables to predict the target variable for cross-fold validation. Is that legitimate? That …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.