Charlotte Werndl (LSE) is speaking at Western University on Monday (the talk will be live-streamed) on evidence and climate change modeling. Having recently read her paper (with co-author Katie Steele) entitled “Climate Models, Calibration, and Confirmation” (CMCC) I thought I would post about it. The paper focuses on the use of evidence in confirming climate models with particular attention paid to double counting, which in this context means using the same evidence for two purposes (more on this use of the term later). I believe the paper is an important one, as it nicely separates concerns about double counting from other, related, confirmatory issues, and I think successfully shows where a form of double counting is legitimate. Still, despite being a casual fan of Bayesianism, I wonder if it gives us what we want in this particular circumstance. I can’t cover all the threads of argument made in the paper, so here I’ll simply discuss what double counting is, why we should worry about it, and how Steele and Werndl (S+W) argue that it could be legitimate in some circumstances.
What’s the worry about double counting? Climate change models typically go through a process called tuning (also sometimes called calibration). Tuning sets the values of parameters in the model that represent highly uncertain processes for which there are few empirical observations. The parameters are treated as “free parameters” that can take on a wide range of values. The values that result in the best fit with observations during tuning are the values chosen for the model. For example, if scientists are interested in global mean surface temperature (GMST), they would vary the parameter values of some uncertain processes until the model’s output of GMST closely matched GMST observations. The model, with these parameter values, would then be used to make climate projections.
The worry is that one way climate models are evaluated is by comparing their results to observations of some historical period; if scientists want to know if a model is adequate for the purpose of predicting GMST, they compare the model output to historical GMST observations. This agreement is supposed to build confidence in (confirm) the model’s ability to simulate the desired quantity. It is typically believed that to gain any confidence in the model at all, the simulation output must be compared to a different set of observations than the one that was used for tuning. After all, the observations used for tuning wouldn’t provide any confidence, because the model was designed to agree with them!
To deal with double counting, CMCC adopts an explicitly Bayesian view of confirmation. The Bayesian view adopted is necessarily contrastive and incremental: a model is confirmed only relative to other models, and the result of confirmation is greater confidence in the model for some particular purpose (not a claim that the model is a correct representation or the truth). Confirmation of one model relative to another can be tracked with the likelihood ratio, which is the probably of the evidence conditional on the first model divided by the probability of the evidence conditional on the second model. If the ratio is >1, the first model is confirmed,
So here is a simple way in which double counting is legitimate on the Bayesian view presented in CMCC. Imagine tuning some model M whose parameters have not yet been set (S+W call this a base model). In order to tune it, we create several different instances of the base-model, all with different parameter values: M1, M2, and so on. We compare the results of each model instance to observations and select the best fitting instance. This is an example of double counting in the following sense: the same data is used to both confirm and tune the model. This is tuning, because we have selected parameter values by comparing outputs to observations, and it is confirmation, because we have gained greater confidence in one instance over all the other instances in light of the data. S+W call this double-counting 1 and it is fairly uncontroversial.
Double-counting 2 seeks to confirm two different (M and L let’s say) base-models, but the situation is much the same. The Bayesian apparatus is more complex, and I’ll leave it to my readers to seek out the details in the paper itself. However, the evaluation still deals with likelihood ratios, it is just that the likelihood ratio needs to take into account all the instances of base-models M and L, as well as our prior probabilities regarding them. The likelihood ratio becomes a weighted sum of the probability of the evidence given each model instance for one base-model over the other. Double-counting 2 is legitimate in two situations 1) the average fit with the data for one base-model’s instances is higher than the other model’s (assuming the priors for each model were equal) and/or 2) the base-models have equivalent fit with the observations, but one model had a higher prior probability (was more plausible). An example of (1) would be that base-model M is tuned to the observations, and on average, its instances are closer to the observations than model L’s. This would result in a greater likelihood for M compared to L, and thus confirm M relative to L. Again, even in this situation tuning “can be regarded as the same process as confirmation in the sense that the evidence is used to do both calibration and confirmation simultaneously” (p618).
S+W do a great job distinguishing the two kinds of double counting and separating them from other concerns about model tuning and climate projections (this work is done in the second half of the paper not discussed here). They seem right, given the view of confirmation they hold, that confirmation and tuning can be done with the same evidence. After all, double counting S+W’s sense is a sophisticated way of saying that the model that fits the data best is more likely.
A few issues worth thinking about:
1) Double counting here is a bit of a misnomer.…