How to calculate lexical cohension and semantic informaticveness for a given dataset?

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …
Category: Data Science

Data science internship/career problem

I'm a data science student, I'm doing an internship in company X. since I joined the company, no task was assigned directly to me. I was asking my tutor to give me a task so he told me to check the current model and see if I can make it better. I did that in ~2 weeks, I've read the idea behind the model, read his code, coded my approach and added the evaluation. When I finished doing it, I …
Category: Data Science

Understanding output stepAIC

I am using the stepAIC function in R to do a bi-directional (forward and backward) stepwise regression. I do not understand what each return value from the function means. The output is: Df Sum of Sq RSS AIC <none> 350.71 -5406.0 - aaa 1 0.283 350.99 -5405.9 - bbb 1 0.339 351.05 -5405.4 - ccc 1 0.982 351.69 -5400.5 - ddd 1 0.989 351.70 -5400.5 Question Are the values listed under Df, Sum of Sq, RSS, and AIC the values …
Category: Data Science

How would you describe cluster 2 from this output of a run of the EM program?

My description: Cluster 2 consists of 9511 instances, the age is around 42 (ranges between 29.7207 and 54.5257). Considering Age, Cluster 2 is very well separated from Cluster 1, with a distance of 18.9513. On the other hand, Cluster 2 and Cluster 0 are very close though, their centroids are withihn a distance of around 0.8248. What else could be added?
Category: Data Science

Interpreting cluster variables - raw vs scaled

I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …
Category: Data Science

making logical inference from a simuation generated data

I have data collected from a computer simulation of football games which seem to have recurring patterns of the following form. if madrid plays arsernal and the match ends under 3 goal, then on their next match against each others, madrid will win. if madrid happens to loose and then plays against chelsea next, they will win 90% of the time. how do I find such inferences from simulation generated data like this. There are other forms of hidden patterns …
Category: Data Science

multi dimensional time series and matrix profile method

I have a time series of the following format: time product1 product2 product3 product4 t1 40 50 68 47 t2 55 60 70 100 t3 606 20 500 52 ... Values are sales. On day t1, how much money was spent on customers buying product1 for example. I want to do time series clustering on this dataset. I am trying to use matrix profile distance. However, this method assumes only 1 dimensional data. Is there anyways to work around this?
Category: Data Science

How to apply entropy discretization to a dataset

I have a simple dataset that I'd like to apply entropy discretization to. The program needs to discretize an attribute based on the following criteria When either the condition “a” or condition “b” is true for a partition, then that partition stops splitting: a- The number of distinct classes within a partition is 1. b- The ratio of the minimum to maximum frequencies among the distinct values for the attribute Class in the partition is <0.5 and the number of …
Category: Data Science

Detect unusal slope increasing

I have a response variable series which will be generated randomly in a fixed interval [0-100] base on every second, and I want to detect the event when the new generated data is significantly greater than data of last second, and send alarm message to me. So, I calculate the difference of response variable by 1 lag and divided by difference of time (slope), than use bootstrapping to construct the 95% confidence interval of response's 90% percentile, if the new …
Category: Data Science

How can deep learning be applied to association rule mining?

Association rule mining is considered to be an old technique of AI. Rules are mined on statistical support. How can deep learning be applied to this? What are approaches for structured data (in a graph format like XML)? XML documents are structured by tags. My goal is to extract a rule that says that tag x is often combined with tag y and z. Then, I later want to apply these rules and if a tag y and z is …
Category: Data Science

Detecting abundance of a certain periodic pattern in a time series?

I am really stumped at the moment about how to solve a particular problem. I have many time series like this: This represents the number of hours a person spends on a website each day throughout the year. Any days where they are not seen to be using the website have zero values, rather than missing values. What I really want to do is to calculate a metric telling me to what extent there is a consistent "1 hour per …
Category: Data Science

Finding data with transformation applied

Is there a way to find relatedness between data and the data obtained after some transformation applied to it? i.e. given a data I need to find the most related data(most of the values in that data can be obtained) that can be found by applying some transformation in original data. I tried but couldn't find a proper answer, most of the discussion that I found is about linear transformation or log transformation but I want to find a way …
Category: Data Science

How to plot using facet_wrap, over multiple pages as a .pdf files in r cran

I am using ggplot, to compare 114 unique studies for a particular variable I'm interested in. This is what I have used. ggplot(steps, aes(x=factor(edu))) + geom_bar(aes(y = (..count..), group = id_study,)) + facet_wrap(~id_study,) Whilst this works, all 114 studies are plotted on one page and the formatting is all squashed. How do I split this over 4x4 pages ? Many thanks S edit **** As there are 114 unique studies, I have 5 pages in total 1) ggplot(steps, aes(x=factor(edu))) + …
Category: Data Science

How to train a model on a data where there are multiple data inside a data point?

I'm trying to do prediction on capacity column, however each data point consist of more data. Each data point represent a cycle data. Each cycle has a capacity. Each cycle runs for some time duration, and in that duration some data is collected over which capacity is dependant I tried exploding the dataset and copying the capacity values to each row, but that shouldn't be the case because each row will get different capacity predicted. Is there a way to …
Category: Data Science

Which data mining or machine learning algorithm would be appropriate for learning ordered frequent patterns?

I have a dataset as (var1, var2, out), where the ordered pair <var1, var2> gives out. Most of the frequent pattern mining algorithms like the Apriori and FP growth algorithms does not preserve the order of var1 and var2. Which are some of the available pattern mining algorithms (may also be a NN trick), to find association rules between ordered pair <var1, var2> and output variable out? Thanks.
Category: Data Science

Modeling the influence of events order on probability

The case is to model if the sequence of events influences the probability of binary target variable. We have for example five different events which occur in time (event: A,B,C,D,E). They can occur in order from 1 to 5. I would like to check if the order of their occurrence influences the target variable. My first idea was to convert the time of occurrence into numbers from 1 to 5 and then for example use logistic regression. Do You know …
Category: Data Science

Increasing minNumObj increasing accuracy in decision tree

I have been using a J48 classifier in weka and have noticed that increasing minNumObj -- The minimum number of instances per leaf leads to a small accuracy increase. -M Result. Size Num Leaves 2 73.8281 % 39 20 3 74.2188 % 39 20 4 74.4792 % 37 19 5 74.6094 % 25 13 6 74.2188 % 23 12 7 74.2188 % 23 12 8 74.349 % 23 12 9 75.2604 % 29 15 10 75.5208 % 29 15 11 …
Category: Data Science

how to align sliding window to extract features from multi modal timeseries data?

I have two datasets that are collected at different frequencies at the same time. One is recorded at 128Hz and another one is recorded at 512 Hz. I am trying to extract some features using the moving window technique but I have some problems. Frequencies of both datasets are different. the timestamp is in unix format and changes in nanoseconds. hence there won't be any match at the start and end of each second or minute. one of the datasets …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.