proximity matrix of random forest and data leakage

My objective is to train a random forest classifier on a binary set of data and use the resulting proximity matrix to understand the sub-populations in the data. I have read some papers on this subject, but I find it difficult to develop a pipeline that is robust and does not leak data. I really want to determine a stable matrix over many iterations so I can be sure it will generalize. For example, I may do something like this:

all_data = mxn matrix
prox_mat = zeros(m,n)

For 100 repetitions
     train, test = stratified_split_data(all_data, 0.75)
     clf = randomforest(ntrees=1000) # random forest with bootstrapping
     clf.fit(train)

     ypred = clf.predict(test) # over the iterations I get a distribution of generalized performance on test data
     
     prox_mat += clf.fit(all_data).prox_mat() # assume a function that will generate this

avg_prox_mat = prox_mat / 100

Is it proper to generate the proximity matrix on all of the data, while having trained the RF on a subset of the data? I can still get a generalized performance metric through the test data, but I wonder if the distances in the proximity matrix are somehow overfit here. Would it be better to do something like this in the inner loop:

prox_mat_temp = clf.fit(test).prox_mat() # a smaller proximity matrix of only test data
prox_mat(test_indicies) = prox_mat_temp # we build up subsets of the proximity matrix over time.

I think then I would have to adjust what I call an average by the % of hold out I used, but then every proximity matrix is 'clean' and as long as I run enough repeats I should build up a clean proximity matrix.

Topic data-leakage random-forest clustering

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.