Finding the best "depth" of ICD9 codes with pseudo-hierarchical clustering

Here is a common problem in health care modeling. Did I just invent a new algorithm or has someone already thought of this?

The goal is to find the most homogeneous partition of patients by medical costs using ICD9 codes. There are 13,000 individual codes in the data set, so using the full code results in many only having a few observations.

ICD9 codes are in a nested hierarchical structure. For instance, all infectious diseases are 001-139, one particular disease is Cholera (ICD9 001), and this can have several other suffixes which further specify the illness. This is a "drill down" so to speak. 001.0 and 001.1 are different types of the same disease.

The goal is to find the best level of detail for the data set. For example, say that these codes are binary for simplicity, and are only 3 digits long. Then the only possible codes are 000, 010, 011, 001, 111, 110, 101, 001.

One way of searching through all possible substrings of these would be to use a decision tree. The x matrix would be all substring combinations, and the response would be the average costs for that particular code.

X would have a column for substring 0, 01, 010, 011, etc. Each row of X would have 2^3*3 = 24 columns. Each row would represent a single code.

The top 3 rows of X would look like this. The response Y would be the average dollar amount of healthcare costs for that ICD code.

Once the data is in this format, a decision tree (or other model) could be used in order to determine which substring prefixes should stay in the model. Because different ICD codes have different number of patients, this would be used as a weight vector. The model could be tuned to allow deeper trees (more digits).

Is there a name for this already? Thanks

Topic embeddings hierarchical-data-format clustering

Category Data Science


I built a solution to this, albeit, manually intensive. List the codes and descriptions in Excel, tab delimited between the code and description. Sort the data based on the ICD code as text, this will address the tree structure of the logic. Add a column to the left of the ICD code then use

=IF(LEN([cell)>3, (LEFT([cell], LEN([cell])-1),"")

to get the first cut a candidate parent. Resort the data set based on the candidate parent. When you have the top layer of parents, link that to the ICD code range you can find at a site like http://www.icd9data.com/2015/Volume1/default.htm. I will revise my response here and post the location of the finished file shortly after I complete my attempt to do this. I will include ICD-10 and a crosswalk of the code.


It's called a prefix tree.

And because the ICD codes are human made, and humans tend to think in categories, there is indeed such a structure in this codes.

It's just that humans first thought of the structure, then made these codes to reflect that structure, not the other way round. What you are now doing is reverse-engineering this...

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.