Finding the best "depth" of ICD9 codes with pseudo-hierarchical clustering
Here is a common problem in health care modeling. Did I just invent a new algorithm or has someone already thought of this?
The goal is to find the most homogeneous partition of patients by medical costs using ICD9 codes. There are 13,000 individual codes in the data set, so using the full code results in many only having a few observations.
ICD9 codes are in a nested hierarchical structure. For instance, all infectious diseases are 001-139, one particular disease is Cholera (ICD9 001), and this can have several other suffixes which further specify the illness. This is a "drill down" so to speak. 001.0 and 001.1 are different types of the same disease.
The goal is to find the best level of detail for the data set. For example, say that these codes are binary for simplicity, and are only 3 digits long. Then the only possible codes are 000, 010, 011, 001, 111, 110, 101, 001.
One way of searching through all possible substrings of these would be to use a decision tree. The x matrix would be all substring combinations, and the response would be the average costs for that particular code.
X would have a column for substring 0, 01, 010, 011, etc. Each row of X would have 2^3*3 = 24 columns. Each row would represent a single code.
The top 3 rows of X would look like this. The response Y would be the average dollar amount of healthcare costs for that ICD code.
Once the data is in this format, a decision tree (or other model) could be used in order to determine which substring prefixes should stay in the model. Because different ICD codes have different number of patients, this would be used as a weight vector. The model could be tuned to allow deeper trees (more digits).
Is there a name for this already? Thanks
Topic embeddings hierarchical-data-format clustering
Category Data Science