Data Science Podcasts?

What are some podcasts which are related to data science? This is a similar question to the reference request question on CrossValidated. Details/rules: The podcasts (the theme and the episodes) should be related to data science. (For example: A podcast which is about some other domain, with an episode which speaks about data science in that domain, is not a good reference/answer.) Personal opinions/reviews (if any) would be very helpful too.
Category: Data Science

How to remove outliers properly?

I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage …
Category: Data Science

Techniques to increase the evaluation speed of a neural network

This is somewhat of an open ended question and in some respects a literature request (I would love to be pointed to a survey paper if one exists). Suppose I am constructing a neural network to make some arbitrary prediction (either categorical, or numeric, doesn't matter). With this network I am concerned primarily with speed of evaluation. Obviously, I want the network to give as accurate as possible predictions, but I'm more than willing to sacrifice some accuracy if it …
Category: Data Science

What are the possible applications of a Data Scientist in the design fase of an Aerospace Or Railway Engineering industry?

I have been trying to understand this for a long time, but this information proves to be incredibly elusive online. What are possible jobs that a pure Data Scientist, without much background knowledge, could be hired for in an Engineering team? I am aware, for instance, that supply chain can get some involvement. I don't mean the Business Intelligence positions, I want to get more involved with the engineering team, working on the products themselves (specially Aerospace or Railway). By …
Category: Data Science

What is the best practice to test a ETL pipeline?

In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software. A ETL pipeline, as a piece of code, should also go through these testing steps to build a healthy system. However due to the nature of ETL process, traditional testing technique may not be applicable. Is there any reference or guideline specifically focus on testing on …
Category: Data Science

Which book is a standard for introduction to genetic algorithms?

I have heard of genetic algorithms, but I have never seen practical examples and I've never got a systematic introduction to them. I am now looking for a textbook which introduces genetic algorithms in detail and gives practical examples how they are used, what their strengths are compared to other solution methods and what their weaknesses are. Is there any standard textbook for this?
Category: Data Science

Low dimensional manifold in a high dimensional space and Geodesic distance

It is a common assumption that high-dimensional objects are lying in low-dimensional manifolds. And this constitutes a foundation for manifold learning or dimensional reduction techniques or (a way to beat the curse of dimensionality). My question is that assuming this is valid, how one can utilize this assumption in doing something such as manifold learning? I think the general goal is to find a nonlinear representation of this high-dimensional objective using a small degree of freedom. However, we know neither …
Category: Data Science

Beginner math books for Machine Learning

I'm a Computer Science engineer with no background in statistics or advanced math. I'm studying the book Python Machine Learning by Raschka and Mirjalili, but when I tried to understand the math of the Machine Learning, I wasn't able to understand the great book that a friend suggest me The Elements of Statistical Learning. Do you know any easier statistics and math books for Machine Learning? If you don't, how should I move?
Category: Data Science

Rate of convergence - comparison of supervised ML methods

I am working on a project with sparse labelled datasets, and am looking for references regarding the rate of convergence of different supervised ML techniques with respect to dataset size. I know that in general boosting algorithms, and other models that can be found in Scikit-learn like SVM's, converge faster than neural networks. However, I cannot find any academic papers that explore, empirically or theoretically, the difference in how much data different methods need before they reach n% accuracy. I …
Category: Data Science

References/tutorials about data mining and machine learning

I am learning data analytics and I wonder if there are some good references and tutorials about machine learning, data analytics and data mining? What I'm searching for is an understandable reference/tutorial, which isn't very technical and isn't very basic either, in other words the material begins with the basic steps towards advanced steps. Thank you.
Category: Data Science

Data science / machine learning books for mathematicians

I have found other requests for references here. In particular in: Where to start, which books and Books about the "Science" in Data Science? I have given a glance to: Artificial Intelligence: A Modern Approach (Russel & Norvig) Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Flach) Learning From Data (Abu-Mostafa et al.) Introduction to Statistical Learning (James et al.) Elements of Statistical Learning (Hastie et al.) Pattern Recognition and Machine Learning (Bishop) Now it …
Category: Data Science

Who invented the concept of over-fitting?

I list the references that I found so far. Shortly, the first appearance of the term was in 1670, first appearance in in close meaning was in 1827, first appearance in a biological paper was in 1923 and first appearance in statistics was in 1935. However, the references indicate that there are gaps in this chronology. Earliest reference I found was The flying pen-man; or, The art of short-writing by William Hopkins (teacher of stenography.) in 1670. However, it is …
Category: Data Science

Machine learning for circular sequences

My data are sequences of real numbers $a_0,a_1,...,a_{n-1}$. The length of a sequence is fixed and equals $n$. Each sequence is mapped to a real number $y$ and I want to predict $y$ given the sequence. The arrangement of the elements within a sequence is important. However, the sequences are circular, meaning that $a_0$ is not the first element, and $a_{n-1}$ is not the last one. The sequence $a_0,a_1,...,a_{n-1}$ is indistinguishable from the sequence $a_k, a_{k+1}, ..., a_{n-1}, a_0, ..., …
Category: Data Science

What is the difference between ICR and OCR?

I've just found the term "Intelligent Character Recognition" (ICR) on Wikipedia and other pages. According to Wikipedia: In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition levels. Is this just a marketing stunt or are there actually techniques which are specified as OCR and other …
Category: Data Science

Why Gradient methods work in finding the parameters in Neural Networks?

After reading quite a lot of papers (20-30 or so), I feel that I am quite not understanding things. Let us focus on the supervised learnings (for example). Given a set of data $\mathcal{D}_{train}=\{(x_i^{train},y_i^{train})\}$ and $\mathcal{D}_{test}=\{(x_i^{test},y_i^{test})\}$ where we assume $y_i^{test}$ are unknown, the goal is to find a function $$ f_\theta(x), \qquad \text{such that} \quad f_\theta(x_i^{test}) \approx y_i^{test}. $$ To do this, we need a model for $f$. Typically, neural networks are frequently employed. Thus we have $$ f_\theta(x) = …
Category: Data Science

Where to upload large (0.5Gb) weights anonymously?

I need to upload a number of checkpoints for ConvNets (weights + optimizers, all dicts of pytorch tensors), each about 0.5Gb anonymously. I don't want to use Google Drive. I trained models on the university cluster (if it's relevant). Where can I load these files anonymously? The files must be publicly available, but my identity must remain anonymous.
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.