I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …
I was reading Modern Optimization with R (Use R!) and wondering if a book like this exists in Python too? To be precise something that covers stochastic gradient descent and other advanced optimization techniques. Many thanks!
I'm a founder in a data science heavy startup, and I'm currently functioning as the entire dev team. Before I know it we'll have people working together on a project I, currently, work on completely alone. So: What are some must have things data scientists need to work together in a production setting? Where are some things data scientist expect to have done outside their scope of work? What would make their lives easier and more productive? What are some …
I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company. They have used Elastic Search (and Kibana) to have reporting dashboards on their daily purchases and user's interactions on their e-commerce website. They also …
I've written a research tool that allows users to write arbitrary expressions to define time series calculated from a set of primary data sources. Many of the functions I provide carry state derived from previous values, such as EMA. In example: EMA(GetData("Foo"), 280) State contained in the component functions of these expressions can be saved and resumed with AST node labeling at compile time. This allows a series to be resumed later when any of its root data sources, which …
I use RStudio for R programming. I remember about solid IDE-s from other technology stacks, like Visual Studio or Eclipse. I have two questions: What other IDE-s than RStudio are used (please consider providing some brief description on them). Does any of them have noticeable advantages over RStudio? I mostly mean debug/build/deploy features, besides coding itself (so text editors are probably not a solution).
I am trying to connect Zoho Analytics and Python for importing data from Zoho Analytics. I have already installed !pip install zoho-analytics-connector. What should I do next? I am new to integrating with other BI tools so unable to find out a better solution. Can you guide me on this? I am referring the instructions from https://pypi.org/project/zoho-analytics-connector/ and https://www.zoho.com/analytics/api/#python-library. from __future__ import with_statement from ReportClient import ReportClient import sys Now, I am getting an error as: Traceback (most recent call …
I have some text corpora to share with non-programming clients (~50K documents, ~100M tokens) who would like to perform operations like regex searches, co-locations, Named-entity recognition, and word clusters. The tool AntConc is nice and can do some of these things, but comes with severe size limitations, and crashes on these corpora even on powerful machines. What cloud-based tools with a web interface would you recommend for this kind of task? Is there an open-source tool or a cloud service …
In most of my projects, I come up with models and want to visualize how some property $x$ varies as a function of a subset of parameters $p_1$,$p_2$, .. etc. So I'll often end up with figures of the "parameter scan" which look like this Those are very helpful for explaining a model / process / datasets. The problem is: I put an inordinate amount of work into producing the data necessary to generate these figures. Most of it wasted …
Often when I am learning new machine learning methods or experimenting with a data analysis algorithm I need to generate a series of 2D points. Teachers also do this often when making a lesson or tutorial. In some cases I just create a function, add some noise, and plot it, but there are many times when I wish I could just click my mouse on a graph to generate points. For instance, when I want to generate a fairly complex …
We run a games platform with millions of users (+- 150,000,000 gameplays / month). We want to find tools or set up a data stack to: collect basic metrics for a specific game such as average gameplay time, 1 day return rate, 7 day return rate,... be able to segment these data by any dimension that we pass along (e.g. by country, by network speed, by ...) generate more advanced insights for a specific game, e.g. this is the distribution …
Dear DS StackExchange community, I'm currently searching the interwebs for a (near-)ready-to-use solution, to perform a qualitative evaluation of extracted features from video data. In my head the tool looks something like the screenshot below (taken from the annotation tool prodigy), in the sense that a video is displayed at the top and underneath one would see a plot of a corresponding feature (selected e.g. via a drop-down menu) extracted from the video. This includes (nearly) every kind of data …
I'm running an experiment where I need to collect and analyse participants' browsing and search histories. The design of the experiment is similar to an "instrumented user panel", described here:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.8971&rep=rep1&type=pdf In the classic case, participants must install some kind of logger on their computers, which collects and sends browsing data to the researcher behind the scenes. Finding such tools is where I get stuck. I could, of course, just ask my participants to export their browsing histories and send them …
Google Data Studio has connectors to MySQL, PostgreSQL, etc. but these connections will come with default names. Couldn't find out how to set the names for data sources in Google Data Studio. Is it even possible?
Is there a way to test out simple filters before committing to coding them? Like if I want to estimate the feasibility of recognizing some features from images. Or to estimate the effort/sophistication of required methods. Then can I try out something in Photoshop or something in order to discover "where to look for"? Prior to coding?
As there are numerous tools available for data science tasks, and it's cumbersome to install everything and build up a perfect system. Is there a Linux/Mac OS image with Python, R and other open-source data science tools installed and available for people to use right away? An Ubuntu or a light weight OS with latest version of Python, R (including IDEs), and other open source data visualization tools installed will be ideal. I haven't come across one in my quick …
I don't have a lot of training data and I'm looking for some tools in python or executable program like labelimg that do some heavy augmentation on images, even better if they also change bounding boxes coordinate accordingly. Any help will be appreciated!
I'm a working in a small company. The company sells products on a website and they have a python script that runs everyday to attribute a score to each product based on a set of parameters (google analytics events, similar products popularity, price, etc). The problem is that the scoring outcome is not satisfying, and requiring developers to edit this script arbitrarily, based on business people assumptions, is time consuming and not a proper way to achieve what the business …
I'm looking for an open source tool to assist my colleagues and I to label images for a machine learning application. We don't actually need bounding boxes or anything to pinpoint regions within each image, and instead need solely global image classifications (e.g. whether the image is one of a cityscape, rural setting etc). The mission-critical functionality we're looking for is: image classification (both radio boxes and checklists) the ability to nest labels, e.g. if label1=cityscape then label2 is required …
I have a large-ish data set (400K records) composed of two fields (both strings). I am looking for a tool that will enable me to cluster the data e.g. around the first column, either using exact matches or some kind of string proximity function like Levenshtein distance. I would also like to be able to find all duplicate records and merge them into one. OpenRefine looks ideal for my purposes but it is so slow when clustering my data or …