Rather than considering models in isolation, how do different models' fits to the same data compare? This session considers ways to quantify the quality of a model fit relative to others.
Key objectives:
Key objectives:
 Use the Bayesian evidence to perform model comparison
 Quantify how often the more complex model could be selected incorrectly
 Quantify how often the simpler model could be selected incorrectly
Exercise 3.1  Bayes factors
Here we use the Bayesian evidence to perform model comparison with Bayes factors. Copy the model_compare.py script to your working directory to perform Bayesian evidence model comparison
 Run the script:
$ python3 model_compare.py outputfiles_basename1/ outputfiles_basename2/ outputfiles_basename3/
 Which model is preferred by the Bayes factors? Try changing the
limit
value in model_compare.py from 30 to 100. Has the preferred model order changed? Why or why not?  Temporarily try including a
cabs
component inmodel3
:
 PyXspec:
model4 = Model("TBabs*zTBabs*cabs(zpowerlw+zgauss)") model4.TBabs.nH.values = (0.04, 1.) ## Milky Way absorption model4.zTBabs.Redshift.values = (0.05, 1.) model4.zTBabs.nH.values = (1., 0.01, 0.01, 0.01, 100., 100.) model4.cabs.nH.link = "p%d" %(model4.zTBabs.nH.index) model4.zpowerlw.PhoIndex = (1.8, 0.1, 3., 2., 9., 10.) model4.zpowerlw.Redshift.link = "p%d" %(model4.zTBabs.Redshift.index) model4.zpowerlw.norm = (1.e4, 0.01, 1.e10, 1.e10, 1., 1.) model4.zgauss.LineE.values = (6.4, 0.1, 6., 6., 7.2, 7.2) model4.zgauss.Sigma.values = (1.e3, 0.01, 1.e3, 1.e3, 1., 1.) model4.zgauss.Redshift.link = "p%d" %(model4.zTBabs.Redshift.index) model4.zgauss.norm.values = (1.e5, 0.01, 1.e10, 1.e10, 1., 1.)
 PyXspec:
 How does the Bayesian Evidence from fitting the new
model4
to the data compare to the previous models?
Exercise 3.2  type I errors
How can we know what Bayes factor threshold to use? Simulations! Here we will calculate the falsepositive rate associated with our data and models to tell us how often a more complex model is selected when the simpler model was actually correct.
 First we will consider the falsepositive rate between the simpler
model1
and more complexmodel2
models.  Simulate 100 NuSTAR spectra from the bestfit
model1
spectrum using the original spectral files in each simulation and bin the simulated spectra in the same manner as the original data.  Using BXA, fit both
model1
andmodel2
to each of the 100 simulated spectra. Make sure to have uniqueoutputfiles_basename
values!  On each iteration, save the
logZ
values:
results = solver.run(resume=True) logZ = results["logZ"]
 Use your two distributions of
logZ
(one distribution for fittingmodel1
simulated data withmodel1
, the other from fitting withmodel2
) to produce a Bayes factor distribution.  What Bayes factor threshold is needed to achieve a falsepositive rate of 1% or less?

Perform the above exercises for
model2
vs.model3
andmodel1
vs.model3
. Are all the Bayes factor thresholds for a <1% falsepositive rate similar between different models? Why or why not?
Exercise 3.3  type II errors
Here we will calculate the falsenegative rate associated with our data and models to tell us how often a less complex model is selected when the more complex model was actually correct.
 First we will consider the falsenegative rate between the more complex
model2
and simplermodel1
models.  Simulate 100 NuSTAR spectra from the bestfit
model2
spectrum using the original spectral files in each simulation and bin the simulated spectra in the same manner as the original data.  Using BXA, fit both
model1
andmodel2
to each of the 100 simulated spectra. Make sure to have uniqueoutputfiles_basename
values!  On each iteration, save the
logZ
values:
results = solver.run(resume=True) logZ = results["logZ"]
 Use your two distributions of
logZ
(one distribution for fittingmodel2
simulated data withmodel1
, the other from fitting withmodel2
) to produce a Bayes factor distribution.  What Bayes factor threshold is needed to achieve a falsenegative rate of 1% or less?

Perform the above exercises for
model3
vs.model2
andmodel3
vs.model1
. Are all the Bayes factor thresholds for a <1% falsenegative rate similar between different models? Why or why not?
Estimating the false positive and negative rates is not only useful for BXA. The process of null hypothesis significance testing is useful to any form of fitting. Here we will apply the same process from Exercises 3.2 and 3.3 to other model comparison techniques.
 From each simulated model bestfit in Exercises 3.2 and 3.3, save the fit statistic (
fitstat
) on each iteration. Calculate aΔfitstat
distribution by subtracting bothfitstat
values per simulated spectrum. WhatΔfitstat
value is required for type I and II errors of less than 1%?  Compute the Akaike Information Criterion (AIC) (given by
AIC = 2×nfree+Cstat ≡ 2×nfree2×log(Likelihood)
, wherenfree
is the number of free fit parameters) for each model in python:
import json basenames = [outputfiles_basename1, outputfiles_basename2, outputfiles_basename3] loglikes = dict([(f, json.load(open(f + "/info/results.json"))["maximum_likelihood"]["logl"]) for f in basenames]) nfree = dict([(f, len(json.load(open(f + "/info/results.json"))["paramnames"])) for f in basenames]) aic = dict([(basename, (2. * nfree[basename]  2. * loglike)) for basename, loglike in loglikes.items()])
 What ΔAIC value is required for type I and II errors of less than 1%?
 Do the observed distributions for estimating type I and II errors change when considering
model2
andmodel3
?