data-mining

How to calculate lexical cohension and semantic informaticveness for a given dataset?

J Cena

2022年6月4日 14:00

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …

Topic: text-mining nlp statistics data-mining

Category: Data Science

Data science internship/career problem

Djellal Mohamed Aniss

2022年6月1日 15:46

I'm a data science student, I'm doing an internship in company X. since I joined the company, no task was assigned directly to me. I was asking my tutor to give me a task so he told me to check the current model and see if I can make it better. I did that in ~2 weeks, I've read the idea behind the model, read his code, coded my approach and added the evaluation. When I finished doing it, I …

Topic: career data-mining

Category: Data Science

Understanding output stepAIC

universalkernel

2022年5月31日 16:47

I am using the stepAIC function in R to do a bi-directional (forward and backward) stepwise regression. I do not understand what each return value from the function means. The output is: Df Sum of Sq RSS AIC <none> 350.71 -5406.0 - aaa 1 0.283 350.99 -5405.9 - bbb 1 0.339 351.05 -5405.4 - ccc 1 0.982 351.69 -5400.5 - ddd 1 0.989 351.70 -5400.5 Question Are the values listed under Df, Sum of Sq, RSS, and AIC the values …

Topic: feature-selection r data-mining

Category: Data Science

How would you describe cluster 2 from this output of a run of the EM program?

Shroomy

2022年5月28日 23:00

My description: Cluster 2 consists of 9511 instances, the age is around 42 (ranges between 29.7207 and 54.5257). Considering Age, Cluster 2 is very well separated from Cluster 1, with a distance of 18.9513. On the other hand, Cluster 2 and Cluster 0 are very close though, their centroids are withihn a distance of around 0.8248. What else could be added?

Topic: expectation-maximization clustering data-mining machine-learning

Category: Data Science

Interpreting cluster variables - raw vs scaled

The Great

2022年5月27日 12:35

I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …

Topic: predictive-modeling k-means clustering data-mining machine-learning

Category: Data Science

making logical inference from a simuation generated data

timothy

2022年5月26日 00:03

I have data collected from a computer simulation of football games which seem to have recurring patterns of the following form. if madrid plays arsernal and the match ends under 3 goal, then on their next match against each others, madrid will win. if madrid happens to loose and then plays against chelsea next, they will win 90% of the time. how do I find such inferences from simulation generated data like this. There are other forms of hidden patterns …

Topic: data-mining machine-learning

Category: Data Science

multi dimensional time series and matrix profile method

user18602524

2022年5月25日 08:07

I have a time series of the following format: time product1 product2 product3 product4 t1 40 50 68 47 t2 55 60 70 100 t3 606 20 500 52 ... Values are sales. On day t1, how much money was spent on customers buying product1 for example. I want to do time series clustering on this dataset. I am trying to use matrix profile distance. However, this method assumes only 1 dimensional data. Is there anyways to work around this?

Topic: time-series data-mining machine-learning

Category: Data Science

Rapidminer and decision tree weights

Qwerto

2022年5月25日 04:02

In Rapidminer, are the decision tree's weights a measure of the "importance" of attributes in the splitting procedure ? If yes, why is useful to know these weights ? Are there better methods to know the most discriminant features in a data set ?

Topic: rapidminer decision-trees feature-selection data-mining machine-learning

Category: Data Science

matrix profile distance measure characterization

user18602524

2022年5月24日 04:05

If there are various types of distances measures for time series, such as Euclidean, DTW, and shape-based ones, how can we characterize the matrix profile distance measure? Profiling one?

Topic: distance clustering data-mining machine-learning

Category: Data Science

How to apply entropy discretization to a dataset

Evan Gertis

2022年5月22日 12:03

I have a simple dataset that I'd like to apply entropy discretization to. The program needs to discretize an attribute based on the following criteria When either the condition “a” or condition “b” is true for a partition, then that partition stops splitting: a- The number of distinct classes within a partition is 1. b- The ratio of the minimum to maximum frequencies among the distinct values for the attribute Class in the partition is <0.5 and the number of …

Topic: pandas python data-mining

Category: Data Science

Detect unusal slope increasing

Robin Chen

2022年5月21日 12:01

I have a response variable series which will be generated randomly in a fixed interval [0-100] base on every second, and I want to detect the event when the new generated data is significantly greater than data of last second, and send alarm message to me. So, I calculate the difference of response variable by 1 lag and divided by difference of time (slope), than use bootstrapping to construct the 95% confidence interval of response's 90% percentile, if the new …

Topic: time-series data-mining

Category: Data Science

How can deep learning be applied to association rule mining?

user3352632

2022年5月18日 12:01

Association rule mining is considered to be an old technique of AI. Rules are mined on statistical support. How can deep learning be applied to this? What are approaches for structured data (in a graph format like XML)? XML documents are structured by tags. My goal is to extract a rule that says that tag x is often combined with tag y and z. Then, I later want to apply these rules and if a tag y and z is …

Topic: knowledge-graph structured-data association-rules deep-learning data-mining

Category: Data Science

Detecting abundance of a certain periodic pattern in a time series?

2022年5月18日 10:02

I am really stumped at the moment about how to solve a particular problem. I have many time series like this: This represents the number of hours a person spends on a website each day throughout the year. Any days where they are not seen to be using the website have zero values, rather than missing values. What I really want to do is to calculate a metric telling me to what extent there is a consistent "1 hour per …

Topic: forecasting anomaly-detection correlation time-series data-mining

Category: Data Science

Finding data with transformation applied

okok

2022年5月18日 02:35

Is there a way to find relatedness between data and the data obtained after some transformation applied to it? i.e. given a data I need to find the most related data(most of the values in that data can be obtained) that can be found by applying some transformation in original data. I tried but couldn't find a proper answer, most of the discussion that I found is about linear transformation or log transformation but I want to find a way …

Topic: transformation data-mining

Category: Data Science

How to plot using facet_wrap, over multiple pages as a .pdf files in r cran

Shivy b

2022年5月17日 18:06

I am using ggplot, to compare 114 unique studies for a particular variable I'm interested in. This is what I have used. ggplot(steps, aes(x=factor(edu))) + geom_bar(aes(y = (..count..), group = id_study,)) + facet_wrap(~id_study,) Whilst this works, all 114 studies are plotted on one page and the formatting is all squashed. How do I split this over 4x4 pages ? Many thanks S edit **** As there are 114 unique studies, I have 5 pages in total 1) ggplot(steps, aes(x=factor(edu))) + …

Topic: plotting ggplot2 data r data-mining

Category: Data Science

How to train a model on a data where there are multiple data inside a data point?

Bhaskar Dhariyal

2022年5月17日 17:08

I'm trying to do prediction on capacity column, however each data point consist of more data. Each data point represent a cycle data. Each cycle has a capacity. Each cycle runs for some time duration, and in that duration some data is collected over which capacity is dependant I tried exploding the dataset and copying the capacity values to each row, but that shouldn't be the case because each row will get different capacity predicted. Is there a way to …

Topic: data-mining machine-learning

Category: Data Science

Which data mining or machine learning algorithm would be appropriate for learning ordered frequent patterns?

user3243499

2022年5月17日 02:09

I have a dataset as (var1, var2, out), where the ordered pair <var1, var2> gives out. Most of the frequent pattern mining algorithms like the Apriori and FP growth algorithms does not preserve the order of var1 and var2. Which are some of the available pattern mining algorithms (may also be a NN trick), to find association rules between ordered pair <var1, var2> and output variable out? Thanks.

Topic: association-rules data-mining machine-learning

Category: Data Science

Modeling the influence of events order on probability

Luc

2022年5月16日 20:01

The case is to model if the sequence of events influences the probability of binary target variable. We have for example five different events which occur in time (event: A,B,C,D,E). They can occur in order from 1 to 5. I would like to check if the order of their occurrence influences the target variable. My first idea was to convert the time of occurrence into numbers from 1 to 5 and then for example use logistic regression. Do You know …

Topic: probability sequence data-mining

Category: Data Science

Increasing minNumObj increasing accuracy in decision tree

Gooze_Berry

2022年5月16日 18:02

I have been using a J48 classifier in weka and have noticed that increasing minNumObj -- The minimum number of instances per leaf leads to a small accuracy increase. -M Result. Size Num Leaves 2 73.8281 % 39 20 3 74.2188 % 39 20 4 74.4792 % 37 19 5 74.6094 % 25 13 6 74.2188 % 23 12 7 74.2188 % 23 12 8 74.349 % 23 12 9 75.2604 % 29 15 10 75.5208 % 29 15 11 …

Topic: weka data-mining machine-learning

Category: Data Science

how to align sliding window to extract features from multi modal timeseries data?

Sri Charan

2022年5月15日 20:30

I have two datasets that are collected at different frequencies at the same time. One is recorded at 128Hz and another one is recorded at 512 Hz. I am trying to extract some features using the moving window technique but I have some problems. Frequencies of both datasets are different. the timestamp is in unix format and changes in nanoseconds. hence there won't be any match at the start and end of each second or minute. one of the datasets …

Topic: multi-instance-learning time-series feature-extraction feature-selection data-mining

Category: Data Science

About