Twice Is Nice! Double counting evidence in climate model confirmation

Charlotte Werndl (LSE) is speaking at Western University on Monday (the talk will be live-streamed) on evidence and climate change modeling. Having recently read her paper (with co-author Katie Steele) entitled “Climate Models, Calibration, and Confirmation” (CMCC) I thought I would post about it. The paper focuses on the use of evidence in confirming climate models with particular attention paid to double counting, which in this context means using the same evidence for two purposes (more on this use of the term later). I believe the paper is an important one, as it nicely separates concerns about double counting from other, related, confirmatory issues, and I think successfully shows where a form of double counting is legitimate. Still, despite being a casual fan of Bayesianism, I wonder if it gives us what we want in this particular circumstance. I can’t cover all the threads of argument made in the paper, so here I’ll simply discuss what double counting is, why we should worry about it, and how Steele and Werndl (S+W) argue that it could be legitimate in some circumstances.

What’s the worry about double counting? Climate change models typically go through a process called tuning (also sometimes called calibration). Tuning sets the values of parameters in the model that represent highly uncertain processes for which there are few empirical observations. The parameters are treated as “free parameters” that can take on a wide range of values. The values that result in the best fit with observations during tuning are the values chosen for the model. For example, if scientists are interested in global mean surface temperature (GMST), they would vary the parameter values of some uncertain processes until the model’s output of GMST closely matched GMST observations. The model, with these parameter values, would then be used to make climate projections.

The worry is that one way climate models are evaluated is by comparing their results to observations of some historical period; if scientists want to know if a model is adequate for the purpose of predicting GMST, they compare the model output to historical GMST observations. This agreement is supposed to build confidence in (confirm) the model’s ability to simulate the desired quantity. It is typically believed that to gain any confidence in the model at all, the simulation output must be compared to a different set of observations than the one that was used for tuning. After all, the observations used for tuning wouldn’t provide any confidence, because the model was designed to agree with them!

To deal with double counting, CMCC adopts an explicitly Bayesian view of confirmation. The Bayesian view adopted is necessarily contrastive and incremental: a model is confirmed only relative to other models, and the result of confirmation is greater confidence in the model for some particular purpose (not a claim that the model is a correct representation or the truth). Confirmation of one model relative to another can be tracked with the likelihood ratio, which is the probably of the evidence conditional on the first model divided by the probability of the evidence conditional on the second model. If the ratio is >1, the first model is confirmed,

So here is a simple way in which double counting is legitimate on the Bayesian view presented in CMCC. Imagine tuning some model M whose parameters have not yet been set (S+W call this a base model). In order to tune it, we create several different instances of the base-model, all with different parameter values: M1, M2, and so on. We compare the results of each model instance to observations and select the best fitting instance. This is an example of double counting in the following sense: the same data is used to both confirm and tune the model. This is tuning, because we have selected parameter values by comparing outputs to observations, and it is confirmation, because we have gained greater confidence in one instance over all the other instances in light of the data. S+W call this double-counting 1 and it is fairly uncontroversial.

Double-counting 2 seeks to confirm two different (M and L let’s say) base-models, but the situation is much the same. The Bayesian apparatus is more complex, and I’ll leave it to my readers to seek out the details in the paper itself. However, the evaluation still deals with likelihood ratios, it is just that the likelihood ratio needs to take into account all the instances of base-models M and L, as well as our prior probabilities regarding them. The likelihood ratio becomes a weighted sum of the probability of the evidence given each model instance for one base-model over the other. Double-counting 2 is legitimate in two situations 1) the average fit with the data for one base-model’s instances is higher than the other model’s (assuming the priors for each model were equal) and/or 2) the base-models have equivalent fit with the observations, but one model had a higher prior probability (was more plausible). An example of (1) would be that base-model M is tuned to the observations, and on average, its instances are closer to the observations than model L’s. This would result in a greater likelihood for M compared to L, and thus confirm M relative to L. Again, even in this situation tuning “can be regarded as the same process as confirmation in the sense that the evidence is used to do both calibration and confirmation simultaneously” (p618).

Quick Comments

S+W do a great job distinguishing the two kinds of double counting and separating them from other concerns about model tuning and climate projections (this work is done in the second half of the paper not discussed here). They seem right, given the view of confirmation they hold, that confirmation and tuning can be done with the same evidence. After all, double counting S+W’s sense is a sophisticated way of saying that the model that fits the data best is more likely.

A few issues worth thinking about:

1) Double counting here is a bit of a misnomer. As S+W make clear, it is not that the data is used twice, it is that it is used for two different purposes.
2) Confirmation for S+W, between two different models, is always confirmation of the model family (all the instances of the base model). It is not clear to me that this always the desired object of confirmation. Sometimes we might want to confirm one instance of a base model against another instance of a base model (the two best performing instances of models, lets say) and as specified in the paper, confirmation via double counting isn’t set up for that.
3) Climate scientists seem to think of confirmation in absolute terms. In S+W’s scheme, this would be confirming a base-model M relative to its complement (all other models that are not M). This is considered on p628. Double counting doesn’t help us here – in order to confirm in this way, we need to know the prior probabilities for all the non-M models. Since we think there are lots of models that we haven’t considered, and we don’t know how many, this is difficult if not impossible to quantify. Double counting, though legitimate in this area, isn’t a remedy for our ills.
4) How applicable is the Bayesian framework in these instances? I haven’t read all his work, but Joel Katzav argues that it isn’t reliable when it comes to this kind of modeling. One reason (at least I gather this is the argument) is that the conditional probabilities we assign in the Bayesian scheme are conditional on the truth of the model, but we know that the models are not true (because they have gross idealizations/simplifications/parameterizations). Thus we can’t/shouldn’t make those assignments. Perhaps the comparative nature of S+W’s confirmation can side step this? If any readers have insight on how this might work, or if it is really a problem, please post in the comments.


  • Charlotte Werndl Reply

    Dear Greg,

    thanks for the nice post!

    One comment about your final question: Suppose you think (as many do) that what the Bayesian machinery tests is whether some hypotheses are true. Then it is important to note that we assume in our paper that there is structural error. That is, the hypothesis here is *not* that a particular climate model is true, but that a particular model *plus* some structural error is true. In this way, what one claims to be true can be quite weak. E.g., you can add error bars that just require that the data points have to fall within -E+A, A+E, where A is the value predicted by the climate model, and E can be as large as you deem to be relevant/interesting (these error bars correspond to what some climate scientists do in their papers).

Leave a Reply

Your email address will not be published. Required fields are marked *