What is the difference between Pachyderm and Git?

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …
Category: Data Science

What's your ideal work environment?

I'm a founder in a data science heavy startup, and I'm currently functioning as the entire dev team. Before I know it we'll have people working together on a project I, currently, work on completely alone. So: What are some must have things data scientists need to work together in a production setting? Where are some things data scientist expect to have done outside their scope of work? What would make their lives easier and more productive? What are some …
Topic: tools
Category: Data Science

What data/analytics tools I need to use at my current e-commerce workplace?

I recently started a new position as a data scientist at an E-commerce company. The company is founded about 4-5 years ago and is new to many data-related areas. Specifically, I'm their first data science employee. So I have to take care of both data analysis tasks as well as bringing new technologies to the company. They have used Elastic Search (and Kibana) to have reporting dashboards on their daily purchases and user's interactions on their e-commerce website. They also …
Category: Data Science

memoizing arbitrary resumable time series, trying not to re-invent the wheel

I've written a research tool that allows users to write arbitrary expressions to define time series calculated from a set of primary data sources. Many of the functions I provide carry state derived from previous values, such as EMA. In example: EMA(GetData("Foo"), 280) State contained in the component functions of these expressions can be saved and resumed with AST node labeling at compile time. This allows a series to be resumed later when any of its root data sources, which …
Category: Data Science

IDE alternatives for R programming (RStudio, IntelliJ IDEA, Eclipse, Visual Studio)

I use RStudio for R programming. I remember about solid IDE-s from other technology stacks, like Visual Studio or Eclipse. I have two questions: What other IDE-s than RStudio are used (please consider providing some brief description on them). Does any of them have noticeable advantages over RStudio? I mostly mean debug/build/deploy features, besides coding itself (so text editors are probably not a solution).
Category: Data Science

How to integrate Zoho analytics with Jupyter notebook?

I am trying to connect Zoho Analytics and Python for importing data from Zoho Analytics. I have already installed !pip install zoho-analytics-connector. What should I do next? I am new to integrating with other BI tools so unable to find out a better solution. Can you guide me on this? I am referring the instructions from https://pypi.org/project/zoho-analytics-connector/ and https://www.zoho.com/analytics/api/#python-library. from __future__ import with_statement from ReportClient import ReportClient import sys Now, I am getting an error as: Traceback (most recent call …
Category: Data Science

Cloud-based visual tool to perform NLP on text corpora

I have some text corpora to share with non-programming clients (~50K documents, ~100M tokens) who would like to perform operations like regex searches, co-locations, Named-entity recognition, and word clusters. The tool AntConc is nice and can do some of these things, but comes with severe size limitations, and crashes on these corpora even on powerful machines. What cloud-based tools with a web interface would you recommend for this kind of task? Is there an open-source tool or a cloud service …
Topic: corpus nlp tools
Category: Data Science

Reusable parameter scans wrapper

In most of my projects, I come up with models and want to visualize how some property $x$ varies as a function of a subset of parameters $p_1$,$p_2$, .. etc. So I'll often end up with figures of the "parameter scan" which look like this Those are very helpful for explaining a model / process / datasets. The problem is: I put an inordinate amount of work into producing the data necessary to generate these figures. Most of it wasted …
Topic: tools
Category: Data Science

Tool to Generate 2D Data via Mouse Clicking

Often when I am learning new machine learning methods or experimenting with a data analysis algorithm I need to generate a series of 2D points. Teachers also do this often when making a lesson or tutorial. In some cases I just create a function, add some noise, and plot it, but there are many times when I wish I could just click my mouse on a graph to generate points. For instance, when I want to generate a fairly complex …
Topic: data tools
Category: Data Science

Tools / tech stack for generating metrics and insights for games

We run a games platform with millions of users (+- 150,000,000 gameplays / month). We want to find tools or set up a data stack to: collect basic metrics for a specific game such as average gameplay time, 1 day return rate, 7 day return rate,... be able to segment these data by any dimension that we pass along (e.g. by country, by network speed, by ...) generate more advanced insights for a specific game, e.g. this is the distribution …
Category: Data Science

UI-based Tool for Qualitative Evaluation of Data Quality

Dear DS StackExchange community, I'm currently searching the interwebs for a (near-)ready-to-use solution, to perform a qualitative evaluation of extracted features from video data. In my head the tool looks something like the screenshot below (taken from the annotation tool prodigy), in the sense that a video is displayed at the top and underneath one would see a plot of a corresponding feature (selected e.g. via a drop-down menu) extracted from the video. This includes (nearly) every kind of data …
Category: Data Science

What tools are out there to collect participants' browsing and/or search data as part of an experiment?

I'm running an experiment where I need to collect and analyse participants' browsing and search histories. The design of the experiment is similar to an "instrumented user panel", described here:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.8971&rep=rep1&type=pdf In the classic case, participants must install some kind of logger on their computers, which collects and sends browsing data to the researcher behind the scenes. Finding such tools is where I get stuck. I could, of course, just ask my participants to export their browsing histories and send them …
Topic: tools
Category: Data Science

Is there a way to test out simple filters before committing to coding them?

Is there a way to test out simple filters before committing to coding them? Like if I want to estimate the feasibility of recognizing some features from images. Or to estimate the effort/sophistication of required methods. Then can I try out something in Photoshop or something in order to discover "where to look for"? Prior to coding?
Category: Data Science

VM image for data science projects

As there are numerous tools available for data science tasks, and it's cumbersome to install everything and build up a perfect system. Is there a Linux/Mac OS image with Python, R and other open-source data science tools installed and available for people to use right away? An Ubuntu or a light weight OS with latest version of Python, R (including IDEs), and other open source data visualization tools installed will be ideal. I haven't come across one in my quick …
Topic: python r tools
Category: Data Science

Data science tools for easing the participation of a business into their scoring system

I'm a working in a small company. The company sells products on a website and they have a python script that runs everyday to attribute a score to each product based on a set of parameters (google analytics events, similar products popularity, price, etc). The problem is that the scoring outcome is not satisfying, and requiring developers to edit this script arbitrarily, based on business people assumptions, is time consuming and not a proper way to achieve what the business …
Category: Data Science

Suggestions for Open-Source Tool for Image Classifications (with Nesting)

I'm looking for an open source tool to assist my colleagues and I to label images for a machine learning application. We don't actually need bounding boxes or anything to pinpoint regions within each image, and instead need solely global image classifications (e.g. whether the image is one of a cityscape, rural setting etc). The mission-critical functionality we're looking for is: image classification (both radio boxes and checklists) the ability to nest labels, e.g. if label1=cityscape then label2 is required …
Category: Data Science

Tool for clustering and cleansing data set

I have a large-ish data set (400K records) composed of two fields (both strings). I am looking for a tool that will enable me to cluster the data e.g. around the first column, either using exact matches or some kind of string proximity function like Levenshtein distance. I would also like to be able to find all duplicate records and merge them into one. OpenRefine looks ideal for my purposes but it is so slow when clustering my data or …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.