We are processing pretty big number of JSON objects (hundreds of thousands daily) and we need to gather insights about the processed objects. We are interested in gathering the following analytics of each field of the processed JSON objects: Percentage when present and missing (null/empty can be considered an equivalent of missing) Possible values and percentage when the value is used for high frequent values Possible number of elements in an array and percentage when the number occurs Because the …
I have a process that (simply put), starts every 5 minutes, collects data, and put that data into the database. More detailed explanation would be that process starts, collects data (which takes some time) and put it on kafka topic (which takes some time). Finally, data from kafka topic are consumed by database (which also take some time). Every record in the database has its insertedOn time rounded up to the second. When I count records (for 4 hours) by …
I'm dealing with outlier detection in data streams. I'm looking for a way to summarize my data and obtain important statistics such as means and variance, etc. I want to know if the cluster features or microclusters are suitable or not.
I would look in an arbitrary window (already relative so unbased), count each element (repetitions) then rank them. Continuously look, group and sum similar "strings". There are two options: Keep a limited bag of results The logical problem that appears to me, in a limited bag of ranked strings, say top 10, is if I remove all keywords with no repetitions, or those with fewer repetitions, I would lose track on them and would re-count them from zero. Unlimited bag …
I am doing research in Continuous Machine Learning/ Life Long Learning. Two of the use cases I came across were Predicting Failures and Anomaly Detection using log stream data. However, already there are Log Analytics tools that can do Anomaly Detection and trigger alerts. I couldn't find any tools that can predict failures. Are there tools that can predict failures too? Is it possible to predict failures using log data? (Ex: system failures). Do we need machine learning to predict …
Well, I post the same question in the main stack before finding the right place, sorry. A friend of mine is working with more than a 100 videos as sample for his neural network. Each video last more than a couple of minutes with around 24 frames per second. The objective, using deep learning, is to detect movement through all the samples. The problem for him is the quantity of data he is dealing with. The training part require/consumes too …
Can anyone recommend a method for summarizing and processing high dimensional data streams efficiently and effectively for anomaly detection? In fact, I investigated the different methods for data stream summarization (sampling, histograms, sketches, wavelets, sliding windows) and am confused about the choice. In fact, I noticed that sampling and sliding windows are general purpose and keep the raw data, while the others are task-specific and make transformations to the data. I am interested in the first case but it may …
I am trying to find a vector that would describe the effects of wind on a multirotor. I have a bunch of datalogs from a single frame of multirotor and am of the mind to digg. The idea is that during flight a multirotor has 2 vectors to fight against: Gravity - to stay in the air and level height Wind In a world without wind the multirotor would only fight against gravity when hovering and when flying in some …
For fun I want to design a convolutional neural net to recognize enemy NPCs in a first person shooter. I have captured 100 jpegs of the npcs as well as 100 jpegs of not-NPCs. I have successfully trained a really simple convNEt to identify NPCs. This was really easy because the game actually highlights the NPCs with a red marker to let humans identify them. Makes it SUPER easy for a machine learning algorthm to find them. Great , so …
I would like to ask if Savitzky-Golay can be implemented on real-time data. I have used it on a fixed array size, but would like to extend it to output values for real-time sensor data. Can anyone refer me to appropriate implementation or hint online implementation. Thanks.
Let's say I have a supervised learning problem with a sequence of features and labels. First, I learn on the training data and then I decide to stream in data, point by point and do online learning. Is it possible to update the weights or figure out the feature importances as each data point comes in? Also, what online learning algorithms would allow me to do this and can this be done in Python?
Popular counting sketches(loglog, hyperloglog, etc) feature natural union operations. Are there any known counting sketches that feature natural intersection operations?
From a data stream i'm receiving a pair of measurements consisting of a current consumption and a current percentage every second. By accumulating the consumption over time it will represent eventually the maximum capacity when the percentage reaches from 100% to 0%. I want to predict the maximum capacity in (almost) real time using linear regression with a small sample size window of two percent. However, when i compare the models of these local regressions of every two percent with …
I'm totally new to the topic of real-time bidding in which I know Machine Learning algorithms are used pretty often. Can somebody explain me the system in a plain language i.e. a language for a non-technical person? What is the bidding? Who bids on what? Where does Machine Learning get involve? What is cookie matching mainly about?
I want to implement Streaming Naive Bayes in a distributed system. What are the best approach to choose framework. Should I choose: Storm alone and implement streaming naive bayes on my own in storm topology. Storm + TridentML Storm + SAMOA Spark Streaming + MLlib What is the best framework set to choose and start working on. Any suggestion will be of great help.
Consider a stream containing tuples (user, new_score) representing users' scores in an online game. The stream could have 100-1,000 new elements per second. The game has 200K to 300K unique players. I would like to have some standing queries like: Which players posted more than x scores in a sliding window of one hour Which players gained x% score in a sliding window of one hour My question is which open source tools can I employ to jumpstart this project? …
(Note: Pulled this question from the list of questions in Area51, but believe the question is self explanatory. That said, believe I get the general intent of the question, and as a result likely able to field any questions on the question that might pop-up.) Which Big Data technology stack is most suitable for processing tweets, extracting/expanding URLs and pushing (only) new links into 3rd party system?