data-stream-mining

What would be the right tool for gathering data structure analytics in a data stream?

Fedor

2022年5月24日 20:07

We are processing pretty big number of JSON objects (hundreds of thousands daily) and we need to gather insights about the processed objects. We are interested in gathering the following analytics of each field of the processed JSON objects: Percentage when present and missing (null/empty can be considered an equivalent of missing) Possible values and percentage when the value is used for high frequent values Possible number of elements in an array and percentage when the number occurs Because the …

Topic: data-stream-mining

Category: Data Science

How to find anomalies in (almost) constant stream of data?

user3225309

2022年4月8日 04:08

I have a process that (simply put), starts every 5 minutes, collects data, and put that data into the database. More detailed explanation would be that process starts, collects data (which takes some time) and put it on kafka topic (which takes some time). Finally, data from kafka topic are consumed by database (which also take some time). Every record in the database has its insertedOn time rounded up to the second. When I count records (for 4 hours) by …

Topic: python-3.x anomaly-detection outlier pandas data-stream-mining

Category: Data Science

Are cluster feature and micro-cluster good summary statistics for outlier detection in high dimensional data streams?

I Sui

2022年1月19日 05:16

I'm dealing with outlier detection in data streams. I'm looking for a way to summarize my data and obtain important statistics such as means and variance, etc. I want to know if the cluster features or microclusters are suitable or not.

Topic: anomaly anomaly-detection outlier data-stream-mining clustering

Category: Data Science

How would you count top repetitions in a stream (grouping and summing similar strings too)

bacloud14

2021年12月13日 15:38

I would look in an arbitrary window (already relative so unbased), count each element (repetitions) then rank them. Continuously look, group and sum similar "strings". There are two options: Keep a limited bag of results The logical problem that appears to me, in a limited bag of ranked strings, say top 10, is if I remove all keywords with no repetitions, or those with fewer repetitions, I would lose track on them and would re-count them from zero. Unlimited bag …

Topic: data-stream-mining

Category: Data Science

Continuous Machine Learning with Log Streams

Kabilesh

2021年6月16日 10:13

I am doing research in Continuous Machine Learning/ Life Long Learning. Two of the use cases I came across were Predicting Failures and Anomaly Detection using log stream data. However, already there are Log Analytics tools that can do Anomaly Detection and trigger alerts. I couldn't find any tools that can predict failures. Are there tools that can predict failures too? Is it possible to predict failures using log data? (Ex: system failures). Do we need machine learning to predict …

Topic: anomaly-detection deep-learning classification data-stream-mining machine-learning

Category: Data Science

reduction of sample from videos sample

T.Dunglas

2020年4月23日 16:09

Well, I post the same question in the main stack before finding the right place, sorry. A friend of mine is working with more than a 100 videos as sample for his neural network. Each video last more than a couple of minutes with around 24 frames per second. The objective, using deep learning, is to detect movement through all the samples. The problem for him is the quantity of data he is dealing with. The training part require/consumes too …

Topic: deep-learning python data-stream-mining

Category: Data Science

High dimensional data stream summarization and processing

I Sui

2019年12月28日 10:39

Can anyone recommend a method for summarizing and processing high dimensional data streams efficiently and effectively for anomaly detection? In fact, I investigated the different methods for data stream summarization (sampling, histograms, sketches, wavelets, sliding windows) and am confused about the choice. In fact, I noticed that sampling and sliding windows are general purpose and keep the raw data, while the others are task-specific and make transformations to the data. I am interested in the first case but it may …

Topic: anomaly-detection dimensionality-reduction data-stream-mining data-mining machine-learning

Category: Data Science

What are the approaches to aggregate categorical variables?

Amir

2019年3月6日 15:26

I am working on a clickstream dataset. I have come up with the following example dataset to explain my problem: ClickTimeStamp | SessionID | ART_weekOfYear | PagenameClicked | TimeSpentPerSession | CustID | ContractID | ... | TARGET | 2017-01-04 16:48:00 | 1 | 1 | P1 | 1 | abc | xyz | | 1 | 2017-01-04 16:48:53 | 1 | 1 | P2 | 1 | abc | xyz | | 1 | 2017-01-11 10:09:57 | 2 | 2 …

Topic: aggregation word-embeddings classification data-stream-mining categorical-data

Category: Data Science

modelling multirotor aerodynamics using datalogs of flights

Karl Uibo

2018年4月11日 14:20

I am trying to find a vector that would describe the effects of wind on a multirotor. I have a bunch of datalogs from a single frame of multirotor and am of the mind to digg. The idea is that during flight a multirotor has 2 vectors to fight against: Gravity - to stay in the air and level height Wind In a world without wind the multirotor would only fight against gravity when hovering and when flying in some …

Topic: data-stream-mining data-mining

Category: Data Science

Designing a ConvNet to facilitate game playing

aquagremlin

2018年2月10日 09:15

For fun I want to design a convolutional neural net to recognize enemy NPCs in a first person shooter. I have captured 100 jpegs of the npcs as well as 100 jpegs of not-NPCs. I have successfully trained a really simple convNEt to identify NPCs. This was really easy because the game actually highlights the NPCs with a red marker to let humans identify them. Makes it SUPER easy for a machine learning algorthm to find them. Great , so …

Topic: neural-network data-stream-mining

Category: Data Science

Real time noise removal using Savitzky-Golay Method

Abdullah Nazir

2017年10月14日 03:38

I would like to ask if Savitzky-Golay can be implemented on real-time data. I have used it on a fixed array size, but would like to extend it to output values for real-time sensor data. Can anyone refer me to appropriate implementation or hint online implementation. Thanks.

Topic: noisification preprocessing online-learning time-series data-stream-mining

Category: Data Science

Online learning w/ feature weighting/adjusting

Jeremy

2017年1月5日 05:54

Let's say I have a supervised learning problem with a sequence of features and labels. First, I learn on the training data and then I decide to stream in data, point by point and do online learning. Is it possible to update the weights or figure out the feature importances as each data point comes in? Also, what online learning algorithms would allow me to do this and can this be done in Python?

Topic: online-learning feature-selection data-stream-mining

Category: Data Science

Is there a counting sketch optimized for intersections?

Newbie

2016年10月21日 04:21

Popular counting sketches(loglog, hyperloglog, etc) feature natural union operations. Are there any known counting sketches that feature natural intersection operations?

Topic: probability data-stream-mining statistics

Category: Data Science

local regression with streaming data

R. Doe

2016年2月16日 13:19

From a data stream i'm receiving a pair of measurements consisting of a current consumption and a current percentage every second. By accumulating the consumption over time it will represent eventually the maximum capacity when the percentage reaches from 100% to 0%. I want to predict the maximum capacity in (almost) real time using linear regression with a small sample size window of two percent. However, when i compare the models of these local regressions of every two percent with …

Topic: linear-regression time-series data-stream-mining predictive-modeling r

Category: Data Science

Analysis of Real-Time Bidding

DanielWelke

2016年2月2日 19:32

I'm totally new to the topic of real-time bidding in which I know Machine Learning algorithms are used pretty often. Can somebody explain me the system in a plain language i.e. a language for a non-technical person? What is the bidding? Who bids on what? Where does Machine Learning get involve? What is cookie matching mainly about?

Topic: data-stream-mining

Category: Data Science

Choosing between Storm+Trident-ML, Storm+SAMOA or Spark Streaming+MLlib

Raman

2015年11月13日 07:17

I want to implement Streaming Naive Bayes in a distributed system. What are the best approach to choose framework. Should I choose: Storm alone and implement streaming naive bayes on my own in storm topology. Storm + TridentML Storm + SAMOA Spark Streaming + MLlib What is the best framework set to choose and start working on. Any suggestion will be of great help.

Topic: apache-spark classification distributed data-stream-mining machine-learning

Category: Data Science

Opensource tools for help in mining stream of leader board scores

Tahir Akhtar

2014年5月19日 07:33

Consider a stream containing tuples (user, new_score) representing users' scores in an online game. The stream could have 100-1,000 new elements per second. The game has 200K to 300K unique players. I would like to have some standing queries like: Which players posted more than x scores in a sliding window of one hour Which players gained x% score in a sliding window of one hour My question is which open source tools can I employ to jumpstart this project? …

Topic: data-stream-mining tools

Category: Data Science

Which Big Data technology stack is most suitable for processing tweets, extracting/expanding URLs and pushing (only) new links into 3rd party system?

blunders

2014年5月18日 15:18

(Note: Pulled this question from the list of questions in Area51, but believe the question is self explanatory. That said, believe I get the general intent of the question, and as a result likely able to field any questions on the question that might pop-up.) Which Big Data technology stack is most suitable for processing tweets, extracting/expanding URLs and pushing (only) new links into 3rd party system?

Topic: data-stream-mining tools bigdata

Category: Data Science

About