proximity matrix of random forest and data leakage
My objective is to train a random forest classifier on a binary set of data and use the resulting proximity matrix to understand the sub-populations in the data. I have read some papers on this subject, but I find it difficult to develop a pipeline that is robust and does not leak data. I really want to determine a stable matrix over many iterations so I can be sure it will generalize. For example, I may do something like this:
all_data = mxn matrix
prox_mat = zeros(m,n)
For 100 repetitions
train, test = stratified_split_data(all_data, 0.75)
clf = randomforest(ntrees=1000) # random forest with bootstrapping
clf.fit(train)
ypred = clf.predict(test) # over the iterations I get a distribution of generalized performance on test data
prox_mat += clf.fit(all_data).prox_mat() # assume a function that will generate this
avg_prox_mat = prox_mat / 100
Is it proper to generate the proximity matrix on all of the data, while having trained the RF on a subset of the data? I can still get a generalized performance metric through the test data, but I wonder if the distances in the proximity matrix are somehow overfit here. Would it be better to do something like this in the inner loop:
prox_mat_temp = clf.fit(test).prox_mat() # a smaller proximity matrix of only test data
prox_mat(test_indicies) = prox_mat_temp # we build up subsets of the proximity matrix over time.
I think then I would have to adjust what I call an average by the % of hold out I used, but then every proximity matrix is 'clean' and as long as I run enough repeats I should build up a clean proximity matrix.
Topic data-leakage random-forest clustering
Category Data Science