This article demonstrates how to perform the bayesian equivalent of a t-test in python. The methodology used was inspired by the Bayesian estimation supersedes the t test (BEST) paper.
Bayesian estimation is a good alternative to the t-test that serves to address the t-test's limitations, mainly its non-intuitive use of the null hypothesis, its use of the normality assumption which is rarely satisfied and the inability to definitevely affirm the null hypothesis.
The bayesian method is more intuitive in that it depends on generating a credible posterior distribution of our observations using the observations themselves as well as our prior beliefs on the distribution, this is demonstrated by
$$P(Parameter | Data) = \dfrac{P(Data | Parameter)*P(Parameter)}{P(Data)}$$where P(Parameter) is the prior distribution and P(Parameter | Data) the posterior.
To build these models and approximate the posterior distribution we will be using the PyMC library.
import pymc as pm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Let's generate some data. We are choosing a normal distribution but we can alternatively choose other distributions.
x1 = np.random.normal(500, 25, 1000)
x2 = np.random.normal(497, 25, 1000)
%matplotlib inline
sns.set(font_scale=1.5)
plt.figure(figsize=(8,6))
sns.distplot(x1)
sns.distplot(x2)
plt.xlabel('x')
plt.ylabel('Density')
plt.legend(['x1', 'x2'])
plt.show()
We then proceed to define our prior distributions. We choose a normal distribution for the data means and a uniform distribution for the variance. The priors and initial parameters are the same as the ones used in the BEST paper.
def setup_priors(data):
''' setup priors. takes in combined populations as input'''
priors = dict()
# Setup our priors
lower = np.var(data)/1000.0
upper = np.var(data)*1000
v = pm.Exponential("nu", beta=1.0/29) + 1
tau = 1.0/np.var(data)/1000.0
mu1 = pm.Normal('mu1', mu=np.mean(data), tau=tau)
var1 = pm.Uniform('var1', lower=lower, upper=upper)
mu2 = pm.Normal('mu2', mu=np.mean(data), tau=tau)
var2 = pm.Uniform('var2', lower=lower, upper=upper)
priors['mu1'] = mu1
priors['var1'] = var1
priors['mu2'] = mu2
priors['var2'] = var2
priors['v'] = v
return priors
We then define the models, choosing Student's t-distribution to avoid any assumption of normality (we know our data is normal in this particular example but not in usual applications).
def generate_models(values, values2, priors):
'''generate models based on priors and given samples'''
v = priors['v']
mu1 = priors['mu1']
var1 = priors['var1']
lam1 = 1.0/var1
mu2 = priors['mu2']
var2 = priors['var2']
lam2 = 1.0/var2
# Include our observed data into the model
t1 = pm.NoncentralT("t1", mu=mu1, lam=lam1, nu=v, value=values, observed=True)
t2 = pm.NoncentralT("t2", mu=mu2, lam=lam2, nu=v, value=values2, observed=True)
# Push our priors into a model
model = pm.Model([t1, mu1, var1, t2, mu2, var2, v])
return model
We can now begin sampling using MCMC. We sample 40000 data points, skipping every other data point and use a burn-in period of 10000 samples.
priors = setup_priors(np.concatenate((x1, x2), axis=0))
model = generate_models(x1, x2, priors)
mcmc = pm.MCMC(model) # Generate our MCMC object
mcmc.sample(40000, 10000, 2) # Run MCMC sampler
Finally we use the posterior data to calculate and plot the difference of means of the two samples.
mu1 = mcmc.trace('mu1')[:] #get posterior means
mu2 = mcmc.trace('mu2')[:]
diff_mus = mu1-mu2 #calculate the difference of the means
sns.set(font_scale=1.5)
plt.figure(figsize=(8,6))
sns.kdeplot(diff_mus)
plt.title('Difference of means density plot')
plt.vlines(0,0,0.45, linestyles='--', color='r')
plt.ylabel('Density')
plt.xlabel('X1 - X2')
plt.show()