I've written a research tool that allows users to write arbitrary expressions to define time series calculated from a set of primary data sources. Many of the functions I provide carry state derived from previous values, such as EMA. In example: EMA(GetData("Foo"), 280) State contained in the component functions of these expressions can be saved and resumed with AST node labeling at compile time. This allows a series to be resumed later when any of its root data sources, which …
I am working on a backend for structuring and submitting data into a ML model. I have 3 questions regarding this process. What is the best method to feed the model continuous data (updated every 30m interval) What is the best method to deliver the datasets? (There are two that are being compared, should I consolidate the data on my side, or leave that to the model?) How should the results be exported from the model and input into my …
I'm looking for a corpus of toy tabular datasets that can be used to test data profiling, machine learning, data manipulation, etc. software. Some example attributes: Strange column names (empty string, long names, duplicate names, names with spaces, periods, syntax, escaped delimiters and tokens) Non-rectangular Mixed scientific notation in floats, inf literals Row-empty or column-empty Mixed file encodings Numeric and string values designed to overflow memory buffers/cause truncation/rounding to int Ambiguous and invalid dates Diacritics, emojis I was going to …
I have been asked to unit-test my machine learning model(not the code that made the model). Since we wouldn't actually know what predictions models make, how to carry out the unit-testing to check the model's predictions against? How is this done? EDIT 1: The machine learning model I have is trained on tabular data of patients. let's take an example of cancer prediction(I am not allowed to disclose the actual one, but this example is very close). It takes multiple …
I am new to deployment and have a basic doubt about deploying my ML code on client's vm. So I have built a python project which collects data from client site, processing, predicts and displays the result in dashboard. I have to use client VMs for deployment. Is there a way for me to hide the code or do something to it so that client cannot see my code and reuse my code for other purposes. Might sound trivial but …
[I strongly agree this is totally very opinionated question, thus narrators feel free to vote to close it if you feel it is right, but I find endless pros and cons on the Internet, I've decided to ask the community here.] Surface Pro 6 or Macbook Pro for Data Scientist Job? About 8 years ago I was a Windows user. The most annoying part was that it was a quite unstable. It is noted that I was not a developer …
I have built an application on Tkinter in Python 3 and I want to package that application with all the dependent packages. I would want to build the .exe application of my python script that installs python 3, some of the packages/dependencies, and install my python script as a .exe. I have heard of py2exe, but is that suggested? How shall I have this and which software is recommended for it? I do not have experience in packing and distributing …
After searching quite some time for it on Google I could not find a sufficient software/toolbox that can manage trainings of neural networks. I thought of a program that combines visualization techniques without the need to write code as well as having the possibility to compare several trainings of neural networks and be able to store them easily. Does a program like this exist? Regards Lukas
I have the below sets of data per application, you can call them as software metrics. These metrics vary depending on the size of an application. Bugs CodeSmells Vulnerability The size of the application is determined by LOC (Lines of code), how can i showcase the complexity of each app relative to the lines of code if i visualize each of these parameters. Example Bugs LOC SweetApp 10 10000 SourApp 120 5660000 SaltyApp 55 1500 How do i visualize Bugs …
Short question I want to learn how to construct data science packages on top of core packages. Is there a list of excellent data science packages I can learn from? Long question I recently came across an excellent video where Joel Grus live codes a neural network library in Python. As an inexperienced data scientist without a software engineering background, this was the first time I saw the construction of a "complete" data science package from scratch. My data analysis …
I'm from programming background. I'm now learning Analytics. I'm learning concepts from basic statistics to model building like linear regression, logistic regression, time-series analysis, etc., As my previous experience is completely on programming, I would like to do some analysis on the data which programmer has. Say, Lets have the details below(I'm using SVN repository) personname, code check-in date, file checked-in, number of times checkedin, branch, check-in date and time, build version, Number of defects, defect date, file that has …
I'm working on a consulting project for a tech client, and caught myself scratching my head about what the best way to present advanced analytics workflow is. What will be shown to the panel will focus on results, but in this particular case it is warranted to show a visual for what the process behind the scenes is. Specifically, I need to show the following: 1) Some raw data file is used as input to a cleaning script which performs …
I have been working on a project as part of my master degree in participation with a firm. I developed a predictive model in the past few months that is essentially a document classification model. The biggest limitation of the research and model is the lack of data available for training. I have a small data set of 300 documents, where as the features are in excess of 15000 terms (before feature selection). How do we identify or estimate the …
I often use Nose, Tox or Unittest when testing my python code, specially when it has to be integrated with other modules or other pieces of code. However, now that I've found myself using R more than python for ML modelling and development. I realized that I don't really test my R code (And more importantly I really don't know how to do it well). So my question is, what are good packages that allow you to test R code …
Twitter is a popular source of data for many applications, especially involving sentiment analysis and the like. I have some things I'm interested in doing with Twitter data, but here's the issue: To get all Tweets, you have to get special permission from Twitter (which, as I understand it, is never granted) or pay big bucks to Gnip or the like. OTOH, Twitter's API documentation says: Few applications require this level of access. Creative use of a combination of other …
We are currently developing a customer relationship management software for SME's. What I'd like to structure for our future CRM is developing CRM with a social-based approach (Social CRM). Therefore we will provide our users (SME's) to integrate their CRM into their social network accounts. Also CRM will be enhance intercorporate communication of owner company. All these processes I've just indicated above will certainly generate lots of unstructured data. I am wondering how can we integrate big data and data-mining …