What kind of regression model should I do?

my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.

I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:

and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?

My problem is about Dependent variable:

since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.

So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?

should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.

I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.

Topic regression research

Category Data Science


One option is to model the target variable of "sustained participation” as an index. An index is a compound measure that aggregates multiple indicators. Examples of indexes are index funds which aggregates many stocks and gross domestic product (GDP) which estimates the total market value of a country.

In your example, you would create a separate model that estimates the "sustained participation” index. That model could be a collections of handcrafted rules or could use machine learning.

The advantage of creating an index is having a single, continuous valued number that could then used as a target in regression.


From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.

Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.


I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).

My thoughts are that in the case of two users:
a) A very active user who was on a long vacation.
b) A new user - who had one action(only on sign up day)
Might have the same sustained-participation metric - if measured as a function of time passed since last action.
But we expect the community to react differently to their actions.

A model might look like:
attention = M(segment_type, time_since_last_activity).
segment_type = G(activity_signals_until_now)

Where activity_signal_until_now may consist:
- total action
- time since first action
- average time between actions

M can be a simple Regressor.
G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.


One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.

So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.