Varying strength of prior for MCMC hierarchical linear model

I am training an MCMC model in using Pymc3.

My aim is to build a series of linear regression models which will predict the time to unload a truck, based on the number of crates to unload.

I have data for 2000 locations and they are divided into 4 categories of location. Locations of the same category tend to have similar unloading times.

For each location I have a series of recorded data points:

  • number of crates on the truck
  • how long it took to unload

I want to train a linear model for each location of this form:

y = α + βx where y is the time, x is the number of crates.

For some locations I may have only 20 data points, others I may have 10,000.

Obviously when I have 10,000 data points there are enough data points to train a simple linear regression at that location with no other input.

However when I have smaller amounts of data, such as 20 points, I am hoping to train an over-arching model on all data points of locations in that category, and use its α and β as priors for the α and β of that location.

I have been following the very helpful example at https://github.com/parsing-science/pymc3_quickstart_guide/blob/master/Quickstart_Guide_no_errors.ipynb.

My questions are:

  • how can I vary the strength of the prior so that it's a really strong prior if I have few data points, and it weakens to zero for the locations where I have enough data points?
  • what values should I set for sigma_alpha and sigma_beta, the stdevs of the prior distributions of α and β? The Pymc3 examples use a uniform distribution, would the best idea to be to work out the approximate distribution by training a linear model at each location and getting the stdev of the parameters?

Here's my MCMC code:

import theano.tensor as T
import numpy as np
import pymc3 as pm
import pandas as pd

# load pandas dataframe df_train which has columns location_id, crates, unloading_time
# load location_category_lookup_dict which gives the category of each location
# load category_model_lookup which gives the pre-trained linear model for each of the 4 categories

num_features = 1

grouped = df_train.groupby(["location_id"])

location_no_to_mcmc_models = {}

for location_id, drops_at_this_location in grouped:
    location_category_id = location_category_lookup_dict[location_id]

    # Look up the intercept and coefficient in the previously trained linear model for this category
    model_for_this_locations_category = category_model_lookup[location_category_id]

    alpha_this_category = model_for_this_locations_category.intercept_
    beta_this_category = model_for_this_locations_category.coef_

    # Get the features (number of crates) and regressand (unloading time)
    features = np.asarray([[c] for c in drops_at_this_location["crates"]])
    y = np.asarray(drops_at_this_location["unloading_time"])

    # Set up the PyMC model
    hlm = pm.Model()
    with hlm:

        # Both alpha and beta are drawn from similar distributions
        sigma_alpha = pm.Uniform("sigma_alpha", .0, 20, testval=2.)

        sigma_beta = pm.Uniform("sigma_beta", .0, 4, testval=2.)

        # Simulate the alphas
        alpha = pm.Normal("alpha", alpha_this_category, sigma_alpha, shape=(1))

        # Simulate the betas
        beta = pm.Normal('beta', beta_this_category, sigma_beta, shape=(1, num_features))

        # Simulate the noise
        s = pm.Uniform("s", .01, 10, shape=1)

        yd = pm.Normal('y', mu=alpha[0] + T.sum(beta[0]*features, 1), tau=s[0] ** -2, observed=y)

        step = pm.NUTS()

    tr = pm.sample(int(1e4), step, model=hlm)

    mcmc_summary = pm.summary(tr)

    location_no_to_mcmc_models[location_no] = mcmc_summary

    print (location_id, "intercept:", mcmc_summary["mean"]["alpha__0"], "weight:", mcmc_summary["mean"]["beta__0_0"])

Topic mcmc numpy linear-regression scikit-learn

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.