Spatial index for variable kernel nonparametric density
I'm trying to build a nonparametric density function for a fairly large dataset that can be evaluated efficently, and can be updated efficiently when new points are added. There will only ever be a maximum of 4 independent variables, but we can start off with 2. Lets use a gaussian kernel. Let the result be a probability density function, i.e. its volume will be 1.
In each evaluation, we can omit all points for which the evaluation point is outside a certain ellipsoid corresponding to the minimum gaussian value we care about. We can change this threshold for accuracy or performance, and the maximum number of points inside the threshold will depend on the chosen covariance matrix of the kernel. Then, we can evaluate the distribution approximately using the subset of points.
If we use a fixed kernel, then we can use the eigenvalues and eigenvectors we get from the covariance matrix to transform each point so that the threshold ellipsoid is a fixed circle. We can then shove all the transformed points into a spatial index, and efficiently find all points within the required radius of the evaluation point.
However, we would the kernel to be variable for two reasons. (1) to fit the data better, and (2) because adding or modifying points would require the fixed kernel to be updated, which would mean that the entire data set would need to be reindexed. With a variable kernel, we could make new/updated points only affect the closest points.
Specifically, is there a spatial index that can efficiently find ellipses surrounding a given point from a set of around 10 million ellipses of different shapes and sizes?
More generally though, does my approach look sound? I am open to answers like "give up and precalculate a grid of results". Answers much appreciated!
Topic data-indexing-techniques
Category Data Science