The Bayesian t-test in python


BEST

This article demonstrates how to perform the bayesian equivalent of a t-test in python. The methodology used was inspired by the Bayesian estimation supersedes the t test (BEST) paper.

Bayesian estimation is a good alternative to the t-test that serves to address the t-test's limitations, mainly its non-intuitive use of the null hypothesis, its use of the normality assumption which is rarely satisfied and the inability to definitevely affirm the null hypothesis.

The bayesian method is more intuitive in that it depends on generating a credible posterior distribution of our observations using the observations themselves as well as our prior beliefs on the distribution, this is demonstrated by

$$P(Parameter | Data) = \dfrac{P(Data | Parameter)*P(Parameter)}{P(Data)}$$

where P(Parameter) is the prior distribution and P(Parameter | Data) the posterior.

To build these models and approximate the posterior distribution we will be using the PyMC library.

In [2]:
import pymc as pm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Let's generate some data. We are choosing a normal distribution but we can alternatively choose other distributions.

In [3]:
x1 = np.random.normal(500, 25, 1000)
x2 = np.random.normal(497, 25, 1000)
In [4]:
%matplotlib inline

sns.set(font_scale=1.5)
plt.figure(figsize=(8,6))

sns.distplot(x1)
sns.distplot(x2)

plt.xlabel('x')
plt.ylabel('Density')
plt.legend(['x1', 'x2'])

plt.show()

We then proceed to define our prior distributions. We choose a normal distribution for the data means and a uniform distribution for the variance. The priors and initial parameters are the same as the ones used in the BEST paper.

In [5]:
def setup_priors(data):
    ''' setup priors. takes in combined populations as input'''
    priors = dict()

    # Setup our priors
    lower = np.var(data)/1000.0
    upper = np.var(data)*1000

    v = pm.Exponential("nu", beta=1.0/29) + 1
    tau = 1.0/np.var(data)/1000.0
    mu1 = pm.Normal('mu1', mu=np.mean(data), tau=tau)
    var1 = pm.Uniform('var1', lower=lower, upper=upper)

    mu2 = pm.Normal('mu2', mu=np.mean(data), tau=tau)
    var2 = pm.Uniform('var2', lower=lower, upper=upper)

    priors['mu1'] = mu1
    priors['var1'] = var1
    priors['mu2'] = mu2
    priors['var2'] = var2
    priors['v'] = v
    
    return priors

We then define the models, choosing Student's t-distribution to avoid any assumption of normality (we know our data is normal in this particular example but not in usual applications).

In [6]:
def generate_models(values, values2, priors):
    '''generate models based on priors and given samples'''
    v = priors['v']
    mu1 = priors['mu1']
    var1 = priors['var1']
    lam1 = 1.0/var1

    mu2 = priors['mu2']
    var2 = priors['var2']
    lam2 = 1.0/var2

    # Include our observed data into the model
    t1 = pm.NoncentralT("t1", mu=mu1, lam=lam1, nu=v, value=values, observed=True)
    t2 = pm.NoncentralT("t2", mu=mu2, lam=lam2, nu=v, value=values2, observed=True)
    # Push our priors into a model
    model = pm.Model([t1, mu1, var1, t2, mu2, var2, v])
    
    return model

We can now begin sampling using MCMC. We sample 40000 data points, skipping every other data point and use a burn-in period of 10000 samples.

In [7]:
priors = setup_priors(np.concatenate((x1, x2), axis=0))
model = generate_models(x1, x2, priors)
mcmc = pm.MCMC(model)  # Generate our MCMC object
mcmc.sample(40000, 10000, 2)  # Run MCMC sampler
 [-----------------100%-----------------] 40000 of 40000 complete in 40.8 sec

Finally we use the posterior data to calculate and plot the difference of means of the two samples.

In [9]:
mu1 = mcmc.trace('mu1')[:] #get posterior means
mu2 = mcmc.trace('mu2')[:]
diff_mus = mu1-mu2 #calculate the difference of the means

sns.set(font_scale=1.5)
plt.figure(figsize=(8,6))

sns.kdeplot(diff_mus)
plt.title('Difference of means density plot')
plt.vlines(0,0,0.45, linestyles='--', color='r')
plt.ylabel('Density')
plt.xlabel('X1 - X2')
plt.show()
• • •