What are some podcasts which are related to data science? This is a similar question to the reference request question on CrossValidated. Details/rules: The podcasts (the theme and the episodes) should be related to data science. (For example: A podcast which is about some other domain, with an episode which speaks about data science in that domain, is not a good reference/answer.) Personal opinions/reviews (if any) would be very helpful too.
I was reading Modern Optimization with R (Use R!) and wondering if a book like this exists in Python too? To be precise something that covers stochastic gradient descent and other advanced optimization techniques. Many thanks!
Is there a fastText embedding in 50 dimensions? I'm aware of GloVe embedding is dimensions (50, 100, 200, 300) dimensions. I am trying to sentiment analysis with a very small dataset. If there is please can anyone provide a reference.
I was wondering what is the best practice for removing outliers from data. Plotting a boxplot for each feature (column of the dataset) and removing data that fall outside the whiskers seems like a naive and problematic approach. For example, say you have many individuals with a 'gender' label and an 'income' label. Also assume that there are many more men in the dataset than women. Unfortunately, due to income disparity we may see that women receive a lower wage …
This is somewhat of an open ended question and in some respects a literature request (I would love to be pointed to a survey paper if one exists). Suppose I am constructing a neural network to make some arbitrary prediction (either categorical, or numeric, doesn't matter). With this network I am concerned primarily with speed of evaluation. Obviously, I want the network to give as accurate as possible predictions, but I'm more than willing to sacrifice some accuracy if it …
I have been trying to understand this for a long time, but this information proves to be incredibly elusive online. What are possible jobs that a pure Data Scientist, without much background knowledge, could be hired for in an Engineering team? I am aware, for instance, that supply chain can get some involvement. I don't mean the Business Intelligence positions, I want to get more involved with the engineering team, working on the products themselves (specially Aerospace or Railway). By …
In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software. A ETL pipeline, as a piece of code, should also go through these testing steps to build a healthy system. However due to the nature of ETL process, traditional testing technique may not be applicable. Is there any reference or guideline specifically focus on testing on …
I have heard of genetic algorithms, but I have never seen practical examples and I've never got a systematic introduction to them. I am now looking for a textbook which introduces genetic algorithms in detail and gives practical examples how they are used, what their strengths are compared to other solution methods and what their weaknesses are. Is there any standard textbook for this?
It is a common assumption that high-dimensional objects are lying in low-dimensional manifolds. And this constitutes a foundation for manifold learning or dimensional reduction techniques or (a way to beat the curse of dimensionality). My question is that assuming this is valid, how one can utilize this assumption in doing something such as manifold learning? I think the general goal is to find a nonlinear representation of this high-dimensional objective using a small degree of freedom. However, we know neither …
I'm a Computer Science engineer with no background in statistics or advanced math. I'm studying the book Python Machine Learning by Raschka and Mirjalili, but when I tried to understand the math of the Machine Learning, I wasn't able to understand the great book that a friend suggest me The Elements of Statistical Learning. Do you know any easier statistics and math books for Machine Learning? If you don't, how should I move?
I am working on a project with sparse labelled datasets, and am looking for references regarding the rate of convergence of different supervised ML techniques with respect to dataset size. I know that in general boosting algorithms, and other models that can be found in Scikit-learn like SVM's, converge faster than neural networks. However, I cannot find any academic papers that explore, empirically or theoretically, the difference in how much data different methods need before they reach n% accuracy. I …
I am learning data analytics and I wonder if there are some good references and tutorials about machine learning, data analytics and data mining? What I'm searching for is an understandable reference/tutorial, which isn't very technical and isn't very basic either, in other words the material begins with the basic steps towards advanced steps. Thank you.
I have found other requests for references here. In particular in: Where to start, which books and Books about the "Science" in Data Science? I have given a glance to: Artificial Intelligence: A Modern Approach (Russel & Norvig) Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Flach) Learning From Data (Abu-Mostafa et al.) Introduction to Statistical Learning (James et al.) Elements of Statistical Learning (Hastie et al.) Pattern Recognition and Machine Learning (Bishop) Now it …
I am aware of Troubleshooting Deep Neural Networks by Josh Tobin and A Recipe for Training Neural Networks by Andrej Karpathy, but I am interested in other resources that can give me some guidelines or steps to setting up and debugging neural networks.
I list the references that I found so far. Shortly, the first appearance of the term was in 1670, first appearance in in close meaning was in 1827, first appearance in a biological paper was in 1923 and first appearance in statistics was in 1935. However, the references indicate that there are gaps in this chronology. Earliest reference I found was The flying pen-man; or, The art of short-writing by William Hopkins (teacher of stenography.) in 1670. However, it is …
My data are sequences of real numbers $a_0,a_1,...,a_{n-1}$. The length of a sequence is fixed and equals $n$. Each sequence is mapped to a real number $y$ and I want to predict $y$ given the sequence. The arrangement of the elements within a sequence is important. However, the sequences are circular, meaning that $a_0$ is not the first element, and $a_{n-1}$ is not the last one. The sequence $a_0,a_1,...,a_{n-1}$ is indistinguishable from the sequence $a_k, a_{k+1}, ..., a_{n-1}, a_0, ..., …
I've just found the term "Intelligent Character Recognition" (ICR) on Wikipedia and other pages. According to Wikipedia: In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition levels. Is this just a marketing stunt or are there actually techniques which are specified as OCR and other …
After reading quite a lot of papers (20-30 or so), I feel that I am quite not understanding things. Let us focus on the supervised learnings (for example). Given a set of data $\mathcal{D}_{train}=\{(x_i^{train},y_i^{train})\}$ and $\mathcal{D}_{test}=\{(x_i^{test},y_i^{test})\}$ where we assume $y_i^{test}$ are unknown, the goal is to find a function $$ f_\theta(x), \qquad \text{such that} \quad f_\theta(x_i^{test}) \approx y_i^{test}. $$ To do this, we need a model for $f$. Typically, neural networks are frequently employed. Thus we have $$ f_\theta(x) = …
I vaguely remember that there was a study / blog post which made a strong point against 3D bar charts. Do you have a source at hand which compares the two - 2D bar charts and 3D bar charts?
I need to upload a number of checkpoints for ConvNets (weights + optimizers, all dicts of pytorch tensors), each about 0.5Gb anonymously. I don't want to use Google Drive. I trained models on the university cluster (if it's relevant). Where can I load these files anonymously? The files must be publicly available, but my identity must remain anonymous.