## Session 3 - model comparison

Rather than considering models in isolation, how do different models' fits to the same data compare? This session considers ways to quantify the quality of a model fit relative to others.

Key objectives:
• Use the Bayesian evidence to perform model comparison
• Quantify how often the more complex model could be selected incorrectly
• Quantify how often the simpler model could be selected incorrectly

### Exercise 3.1 - Bayes factors

Here we use the Bayesian evidence to perform model comparison with Bayes factors.
• Copy the model_compare.py script to your working directory to perform Bayesian evidence model comparison

• Run the script:
`\$ python3 model_compare.py outputfiles_basename1/ outputfiles_basename2/ outputfiles_basename3/`

• Which model is preferred by the Bayes factors? Try changing the `limit` value in model_compare.py from 30 to 100. Has the preferred model order changed? Why or why not?

• Temporarily try including a `cabs` component in `model3`:
• PyXspec:
```model4 = Model("TBabs*zTBabs*cabs(zpowerlw+zgauss)") model4.TBabs.nH.values = (0.04, -1.) ## Milky Way absorption model4.zTBabs.Redshift.values = (0.05, -1.) model4.zTBabs.nH.values = (1., 0.01, 0.01, 0.01, 100., 100.) model4.cabs.nH.link = "p%d" %(model4.zTBabs.nH.index) model4.zpowerlw.PhoIndex = (1.8, 0.1, -3., -2., 9., 10.) model4.zpowerlw.Redshift.link = "p%d" %(model4.zTBabs.Redshift.index) model4.zpowerlw.norm = (1.e-4, 0.01, 1.e-10, 1.e-10, 1., 1.) model4.zgauss.LineE.values = (6.4, 0.1, 6., 6., 7.2, 7.2) model4.zgauss.Sigma.values = (1.e-3, 0.01, 1.e-3, 1.e-3, 1., 1.) model4.zgauss.Redshift.link = "p%d" %(model4.zTBabs.Redshift.index) model4.zgauss.norm.values = (1.e-5, 0.01, 1.e-10, 1.e-10, 1., 1.) ```

• How does the Bayesian Evidence from fitting the new `model4` to the data compare to the previous models?

### Exercise 3.2 - type I errors

How can we know what Bayes factor threshold to use? Simulations! Here we will calculate the false-positive rate associated with our data and models to tell us how often a more complex model is selected when the simpler model was actually correct.

• First we will consider the false-positive rate between the simpler `model1` and more complex `model2` models.

• Simulate 100 NuSTAR spectra from the best-fit `model1` spectrum using the original spectral files in each simulation and bin the simulated spectra in the same manner as the original data.

• Using BXA, fit both `model1` and `model2` to each of the 100 simulated spectra. Make sure to have unique `outputfiles_basename` values!

• On each iteration, save the `logZ` values:
```results = solver.run(resume=True) logZ = results["logZ"]```

• Use your two distributions of `logZ` (one distribution for fitting `model1`-simulated data with `model1`, the other from fitting with `model2`) to produce a Bayes factor distribution.

• What Bayes factor threshold is needed to achieve a false-positive rate of 1% or less?

• Perform the above exercises for `model2` vs. `model3` and `model1` vs. `model3`. Are all the Bayes factor thresholds for a <1% false-positive rate similar between different models? Why or why not? ### Exercise 3.3 - type II errors

Here we will calculate the false-negative rate associated with our data and models to tell us how often a less complex model is selected when the more complex model was actually correct.

• First we will consider the false-negative rate between the more complex `model2` and simpler `model1` models.

• Simulate 100 NuSTAR spectra from the best-fit `model2` spectrum using the original spectral files in each simulation and bin the simulated spectra in the same manner as the original data.

• Using BXA, fit both `model1` and `model2` to each of the 100 simulated spectra. Make sure to have unique `outputfiles_basename` values!

• On each iteration, save the `logZ` values:
```results = solver.run(resume=True) logZ = results["logZ"]```

• Use your two distributions of `logZ` (one distribution for fitting `model2`-simulated data with `model1`, the other from fitting with `model2`) to produce a Bayes factor distribution.

• What Bayes factor threshold is needed to achieve a false-negative rate of 1% or less?

• Perform the above exercises for `model3` vs. `model2` and `model3` vs. `model1`. Are all the Bayes factor thresholds for a <1% false-negative rate similar between different models? Why or why not?

### Exercise 3.4 - extension

Estimating the false positive and negative rates is not only useful for BXA. The process of null hypothesis significance testing is useful to any form of fitting. Here we will apply the same process from Exercises 3.2 and 3.3 to other model comparison techniques.

• From each simulated model best-fit in Exercises 3.2 and 3.3, save the fit statistic (`fitstat`) on each iteration. Calculate a `Δfitstat` distribution by subtracting both `fitstat` values per simulated spectrum. What `Δfitstat` value is required for type I and II errors of less than 1%?

• Compute the Akaike Information Criterion (AIC) (given by `AIC = 2×nfree+Cstat ≡ 2×nfree-2×log(Likelihood)`, where `nfree` is the number of free fit parameters) for each model in python:
```import json basenames = [outputfiles_basename1, outputfiles_basename2, outputfiles_basename3] loglikes = dict([(f, json.load(open(f + "/info/results.json"))["maximum_likelihood"]["logl"]) for f in basenames]) nfree = dict([(f, len(json.load(open(f + "/info/results.json"))["paramnames"])) for f in basenames]) aic = dict([(basename, (2. * nfree[basename] - 2. * loglike)) for basename, loglike in loglikes.items()]) ```

• What ΔAIC value is required for type I and II errors of less than 1%?

• Do the observed distributions for estimating type I and II errors change when considering `model2` and `model3`?