Session 3

Session 3 - model comparison

Rather than considering models in isolation, how do different models' fits to the same data compare? This session considers ways to quantify the quality of a model fit relative to others.

Key objectives:

Use the Bayesian evidence to perform model comparison
Quantify how often the more complex model could be selected incorrectly
Quantify how often the simpler model could be selected incorrectly

Exercise 3.1 - Bayes factors

Here we use the Bayesian evidence to perform model comparison with Bayes factors.

Copy the model_compare.py script to your working directory to perform Bayesian evidence model comparison

Run the script:
$ python3 model_compare.py outputfiles_basename1/ outputfiles_basename2/ outputfiles_basename3/

Which model is preferred by the Bayes factors? Try changing the limit value in model_compare.py from 30 to 100. Has the preferred model order changed? Why or why not?

Temporarily try including a cabs component in model3:

PyXspec:

model4 = Model("TBabs*zTBabs*cabs(zpowerlw+zgauss)")
model4.TBabs.nH.values = (0.04, -1.) ## Milky Way absorption
model4.zTBabs.Redshift.values = (0.05, -1.)
model4.zTBabs.nH.values = (1., 0.01, 0.01, 0.01, 100., 100.)
model4.cabs.nH.link = "p%d" %(model4.zTBabs.nH.index)
model4.zpowerlw.PhoIndex = (1.8, 0.1, -3., -2., 9., 10.)
model4.zpowerlw.Redshift.link = "p%d" %(model4.zTBabs.Redshift.index)
model4.zpowerlw.norm = (1.e-4, 0.01, 1.e-10, 1.e-10, 1., 1.)
model4.zgauss.LineE.values = (6.4, 0.1, 6., 6., 7.2, 7.2)
model4.zgauss.Sigma.values = (1.e-3, 0.01, 1.e-3, 1.e-3, 1., 1.)
model4.zgauss.Redshift.link = "p%d" %(model4.zTBabs.Redshift.index)
model4.zgauss.norm.values = (1.e-5, 0.01, 1.e-10, 1.e-10, 1., 1.)

How does the Bayesian Evidence from fitting the new model4 to the data compare to the previous models?

Exercise 3.2 - type I errors

How can we know what Bayes factor threshold to use? Simulations! Here we will calculate the false-positive rate associated with our data and models to tell us how often a more complex model is selected when the simpler model was actually correct.

First we will consider the false-positive rate between the simpler model1 and more complex model2 models.

Simulate 100 NuSTAR spectra from the best-fit model1 spectrum using the original spectral files in each simulation and bin the simulated spectra in the same manner as the original data.

Using BXA, fit both model1 and model2 to each of the 100 simulated spectra. Make sure to have unique outputfiles_basename values!

On each iteration, save the logZ values:

results = solver.run(resume=True)
logZ = results["logZ"]

Use your two distributions of logZ (one distribution for fitting model1-simulated data with model1, the other from fitting with model2) to produce a Bayes factor distribution.

What Bayes factor threshold is needed to achieve a false-positive rate of 1% or less?

Perform the above exercises for model2 vs. model3 and model1 vs. model3. Are all the Bayes factor thresholds for a <1% false-positive rate similar between different models? Why or why not?

Exercise 3.3 - type II errors

Here we will calculate the false-negative rate associated with our data and models to tell us how often a less complex model is selected when the more complex model was actually correct.

First we will consider the false-negative rate between the more complex model2 and simpler model1 models.

Simulate 100 NuSTAR spectra from the best-fit model2 spectrum using the original spectral files in each simulation and bin the simulated spectra in the same manner as the original data.

Using BXA, fit both model1 and model2 to each of the 100 simulated spectra. Make sure to have unique outputfiles_basename values!

On each iteration, save the logZ values:

results = solver.run(resume=True)
logZ = results["logZ"]

Use your two distributions of logZ (one distribution for fitting model2-simulated data with model1, the other from fitting with model2) to produce a Bayes factor distribution.

What Bayes factor threshold is needed to achieve a false-negative rate of 1% or less?

Perform the above exercises for model3 vs. model2 and model3 vs. model1. Are all the Bayes factor thresholds for a <1% false-negative rate similar between different models? Why or why not?

Exercise 3.4 - extension

Estimating the false positive and negative rates is not only useful for BXA. The process of null hypothesis significance testing is useful to any form of fitting. Here we will apply the same process from Exercises 3.2 and 3.3 to other model comparison techniques.

From each simulated model best-fit in Exercises 3.2 and 3.3, save the fit statistic (fitstat) on each iteration. Calculate a Δfitstat distribution by subtracting both fitstat values per simulated spectrum. What Δfitstat value is required for type I and II errors of less than 1%?

Compute the Akaike Information Criterion (AIC) (given by AIC = 2×nfree+Cstat ≡ 2×nfree-2×log(Likelihood), where nfree is the number of free fit parameters) for each model in python:

import json
basenames = [outputfiles_basename1, outputfiles_basename2, outputfiles_basename3]
loglikes = dict([(f, json.load(open(f + "/info/results.json"))["maximum_likelihood"]["logl"]) for f in basenames])
nfree = dict([(f, len(json.load(open(f + "/info/results.json"))["paramnames"])) for f in basenames])
aic = dict([(basename, (2. * nfree[basename] - 2. * loglike)) for basename, loglike in loglikes.items()])

What ΔAIC value is required for type I and II errors of less than 1%?

Do the observed distributions for estimating type I and II errors change when considering model2 and model3?

Session 3

Session 3 - model comparison

Exercise 3.1 - Bayes factors

Exercise 3.2 - type I errors

Exercise 3.3 - type II errors

Exercise 3.4 - extension

Previous: Session 2

Next: Session 4

Contact

Raising issues

Email