cc: Phil Jones <p.jones@uea.ac.uk>, carl mears <mears@remss.com>,  Karl Taylor <taylor13@llnl.gov>, Tom Wigley <wigley@ucar.edu>, Tom Wigley <wigley@cgd.ucar.edu>,  "Thorne, Peter" <peter.thorne@metoffice.gov.uk>, Steven Sherwood <Steven.Sherwood@yale.edu>,  John Lanzante <John.Lanzante@noaa.gov>, Melissa Free <Melissa.Free@noaa.gov>,  Frank Wentz <frank.wentz@remss.com>, Steve Klein <klein21@mail.llnl.gov>,  Leopold Haimberger <leopold.haimberger@univie.ac.at>, peter gleckler <gleckler1@llnl.gov>
date: Thu, 06 Dec 2007 10:48:26 -0800
from: Ben Santer <santer1@llnl.gov>
subject: Re: [Fwd: sorry to take your time up, but really do need a   scrub
to: Dian Seidel <dian.seidel@noaa.gov>

<x-flowed>
Dear Dian,

Thanks very much for your email. I agree that the problem of 
partitioning a multi-model ensemble into groups that are "more reliable" 
and "less reliable" for some specific application is a difficult one.

Our recent PNAS water vapor paper provided much of the motivation for 
our new work on "model culling". The PNAS paper was our first attempt to 
  use data from a large multi-model ensemble in a formal, pattern-based 
detection and attribution (D&A) study. It was obvious from this initial 
research that some models systematically underestimated the amplitude of 
month-to-month and year-to-year variability in < Wo > (the spatial 
average of total column water vapor over near-global oceans; see Figures 
3A and 3B from our PNAS paper). Other model overestimated the amplitude 
of < Wo > variability by a factor of 2-3. Since model-based noise 
estimates from long control integrations were an integral part of our 
D&A study (and of any D&A study), the model variability errors led to 
persistent questions about the sensitivity of D&A results to "model 
quality". At almost every presentation that I gave on this stuff, I was 
asked, "What happens if you repeat the entire D&A analysis with a subset 
of the 22 original models - a subset which performs better in simulating 
the observed low-frequency variability of < Wo >? Do you still obtain 
detection of an anthropogenic fingerprint in the SSM/I data"

This is a tough question to answer, particularly given the short length 
(20 years) of the SSM/I water vapor data. However, we do know that 
there's a fairly tight coupling (at least over fairly large space and 
time scales) between variability in tropical SSTs and variability in 
total column water vapor (see Fig. 3C in our PNAS paper). So in our 
"model culling" work, we've looked at BOTH SST and water vapor data - 
not at water vapor data only. Given the availability of observational 
SST records that are significantly longer than the SSM/I water vapor 
record, we've been able to look at the fidelity with which models 
simulate the observed decadal-timescale SST variability (both in terms 
of the amplitude and pattern) in a variety of different regions (AMO, 
PDO, tropical oceans, various NINO regions, etc.) We've also considered 
how well models simulate the observed climatological annual-mean mean 
patterns of SST and water vapor, as well as the spatial patterns of the 
climatological seasonal cycle.

As you noted, the variability information is not totally orthogonal from 
information about the amplitude of the response to anthropogenic forcing 
(see, e.g., an early paper on this subject by Tom Wigley and Sarah Raper 
in Nature in 1990). But it's clearly important to include some form of 
variability information in the "model culling" exercise, since we are 
using model-based natural variability estimates to determine the 
statistical significance of D&A results. Use of information on the mean 
state alone would be inadequate. Some AR4 models still use flux 
correction, and so do a relatively good job in capturing the mean state 
of SST and water vapor, but have significant problems in their 
representation of variability.

What we are finding so far is that there are "horses for courses". As I 
mentioned yesterday, there is no individual model that does well in all 
of the SST and water vapor tests we've applied. If we looked at a 
completely different application, as you suggested (such as a D&A study 
involving storm tracks or cloud cover), I strongly suspect we'd arrive 
at a very different model ranking.

With best regards,

Ben

Dian Seidel wrote:
> Hello Ben and Colleagues,
> 
> I've been following these exchanges with interest.  One particular point 
> in your message below is a little puzzling to me.  That's the issue of 
> trying to avoid circularity in the culling of models for any given D&A 
> study.
> Two potential problems occur to me.  One is that choosing models on the 
> basis of their fidelity to observed regional and short term variability 
> may not be completely orthogonal to choosing based on long-term trend.  
> That's because those smaller scale changes may contribute to the trends 
> and their patterns.  Second, choosing a different set of models for one 
> variable (temperature) than for another (humidity) seems highly 
> problematic.  If we are interested in projections of other variables, 
> e.g. storm tracks or cloud cover, for which D&A has not been done, which 
> group of models would we then deem to be most credible?  I don't have a 
> good alternative to propose, but, in light of these considerations, 
> maybe one-model-one-vote doesn't appear so unreasonable after all.
> 
> With regards,
> Dian
> 
> Ben Santer wrote:
>> Dear Phil,
>>
>> Just a quick response to the issue of "model weighting" which you and 
>> Carl raised in your emails.
>>
>> We recently published a paper dealing with the identification of an 
>> anthropogenic fingerprint in SSM/I-based estimates of total column 
>> water vapor changes. This was a true multi-model detection and 
>> attribution ("D&A") study, which made use of results from 22 different 
>> A/OGCMs for fingerprint and noise estimation. Together with Peter 
>> Gleckler and Karl Taylor, I'm now in the process of repeating our 
>> water vapor D&A study using a subset of the original 22 models. This 
>> subset will comprise 10-12 models which are demonstrably more 
>> successful in capturing features of the observed mean state and 
>> variability of water vapor and SST - particularly features crucial to 
>> the D&A problem (such as the low-frequency variability). We've had fun 
>> computing a whole range of metrics that might be used to define such a 
>> subset of "better" models. The ultimate goal is to determine the 
>> sensitivity of our water vapor D&A results to model quality. I think 
>> that this kind of analysis will be unavoidable in the multi-model 
>> world in which we now live. Given substantial inter-model differences 
>> in simulation quality, "one model, one vote" is probably not the best 
>> policy for D&A work!
>>
>> Once we've used Carl's method to calculate synthetic MSU temperatures 
>> from the IPCC AR4 20c3m data (as described in my previous email), it 
>> should be relatively easy to do a similar "model culling" exercise 
>> with MSU T2, T4, and TLT. In fact, this is what we had already planned 
>> to do in collaboration with Carl and Frank.
>>
>> One key point in any model weighting or selection strategy is to avoid 
>> circularity. In the D&A context, it would be impermissible to include 
>> information on trend behavior as a criterion used for selecting 
>> "better" models. Likewise, if our interest is in assessing the 
>> statistical significance of model-versus-observed trend differences, 
>> we can't use model performance in simulating "observed" tropospheric 
>> or stratospheric trends (whatever those might be!) as a means of 
>> identifying more credible models.
>>
>> A further issue, of course, is that we are relying on results from 
>> fully coupled A/OGCMs, and are making trend comparisons over 
>> relatively short periods (several decades). On these short timescales, 
>> estimates of the "true" trend in response to the applied 20c3m 
>> forcings are quite sensitive to natural variability noise (as Peter 
>> Thorne's 2007 GRL paper clearly illustrates). Because of such chaotic 
>> variability, even a hypothetical model with perfect physics and 
>> forcings would yield a distribution of tropospheric temperature trends 
>> over 1979 to 1999, some of which would show larger or smaller cooling 
>> than observed. This is why it's illogical to stratify model results 
>> according to correspondence between modeled and observed surface 
>> warming - something which John Christy is very fond of doing.
>>
>> What we've done (in the new water vapor work described above) is to 
>> evaluate the fidelity with which the AR4 models simulate the observed 
>> mean state and variability of precipitable water and SST - not the 
>> trends in these quantities. We've looked at a model performance in a 
>> variety of different regions, and on multiple timescales. The results 
>> are fascinating, and show (at least for water vapor and SST) that 
>> every model has its own individual strengths and weaknesses. It is 
>> difficult to identify a subset of models that CONSISTENTLY does well 
>> in many different regions and over a range of different timescales.
>>
>> My guess is that we would obtain somewhat different results for MSU 
>> temperatures - particularly for comparisons involving variability. 
>> Clearly, the absence of volcanic forcing in roughly half of the 20c3m 
>> experiments will have a large impact on the estimated variability of 
>> synthetic T4 temperatures (and perhaps even on T2), and hence on 
>> model-versus-data variability comparisons. It's also quite possible 
>> that the inclusion or absence of volcanic forcing has an impact not 
>> only on the amplitude of the variability of global-mean T4 anomalies, 
>> but also on the pattern of T4 variability. So model ranking exercises 
>> based on performance in simulating the mean state and variability of 
>> T4 and T2 may show some connection to the presence or absence of 
>> volcanic/ozone forcing.
>>
>> The sad thing is we are being distracted from doing this fun stuff by 
>> the need to respond to Douglass et al. That's a real shame.
>>
>> With best regards,
>>
>> Ben
>>
>> Phil Jones wrote:
>>>  All,
>>>    IJC do have comments but only very rarely. I see little point in 
>>> doing this
>>>  as there is likely to be a word limit, and if the system works properly
>>>  Douglass et al would get the final say. There is also a large 
>>> backlog in
>>>  papers awaiting to appear, so even if the comment were accepted it 
>>> would
>>>  be some time after Douglass et al that it would appear.
>>>    Better would be a submission to another journal (JGR?) which
>>>  would be quicker. This could go in before Douglass et al appeared in
>>>  print - it should be in the IJC early online view fairly soon based on
>>>  recent experiences.
>>>    A paper pointing out the issues of trying to weight models in some 
>>> way
>>>  would be very beneficial to the community. AR5 will have to go down 
>>> this
>>>  route at some point. How models simulate the
>>>  recent trends at the surface and in the troposphere/stratosphere and
>>>  how they might be ranked is a possibility. This could bring in the
>>>  new work Peter alludes to with the sondes.
>>>    There are also some aspects of recent surface T changes that could be
>>>  discussed as well. These relate to the growing dominance of buoy SSTs
>>>  (now 70% of the total) vs conventional ships. There is a paper in J. 
>>> Climate
>>>  accepted from Smith/Reynolds et al at NCDC, which show that buoys
>>>  could conceivably be cooler than ship-based SST by about 0.1C - meaning
>>>  that the last 5-10 years are being gradually underestimated over the 
>>> oceans.
>>>  Overlap is still too short to be confident about this, but it 
>>> highlights a
>>>  major systematic change occurring in surface ocean measurements. As the
>>>  buoys are presumably better for absolute SSTs, this means models
>>>  driven with fixed SSTs should be using fields that are marginally 
>>> cooler.
>>>
>>>    And then there is the continual reference to Kalnay and Cai, when
>>>  Simmons et al (2004) have shown the problems with NCEP. It is possible
>>>  to add in the ERA-Interim analyses and operational analyses to
>>>  being results from ERA-40 up to date.
>>>
>>>  Cheers
>>>  Phil
>>>
>>>
>>> At 23:40 04/12/2007, carl mears wrote:
>>>> Karl -- thanks for clarifying what I was trying to say
>>>>
>>>> Some further comments.....
>>>>
>>>> At 02:53 PM 12/4/2007, Karl Taylor wrote:
>>>>> Dear all,
>>>>> 2) unforced variability hasn't dominated the observations.
>>>>
>>>> But on this short time scale, we strongly suspect that it has 
>>>> dominated.  For example, the
>>>> 2 sigma error bars from table 3.4, CCSP for satellite TLT are 0.18 
>>>> (UAH) or 0.19 (RSS), larger
>>>> than either group's trends (0.05, 0.15) for 1979-2004.  These were 
>>>> calculated using a "goodness
>>>> of linear fit"  criterion, corrected for autocorrelation.  This is a 
>>>> probably a reasonable
>>>> estimate of the contribution of unforced variability to trend 
>>>> uncertainty.
>>>>
>>>>
>>>>
>>>>> Douglass et al. have *not* shown that every individual model is in 
>>>>> fact inconsistent with the observations.  If the spread of 
>>>>> individual model results is large enough and at least 1 model 
>>>>> overlaps the observations, then one cannot claim that all models 
>>>>> are wrong, just that the mean is biased.
>>>>
>>>>
>>>> Given the magnitude of the unforced variability, I would say "the 
>>>> mean *may* be biased."  You can't prove this
>>>> with only one universe, as Tom alluded.  All we can say is that the 
>>>> observed trend cannot be proven to
>>>> be inconsistent with the model results, since it is inside their range.
>>>>
>>>> It we interesting to see if we can say anything more, when we start 
>>>> culling out the less realistic models,
>>>> as Ben has suggested.
>>>>
>>>> -Carl
>>>>
>>>>
>>>>
>>>>
>>>
>>> Prof. Phil Jones
>>> Climatic Research Unit        Telephone +44 (0) 1603 592090
>>> School of Environmental Sciences    Fax +44 (0) 1603 507784
>>> University of East Anglia
>>> Norwich                          Email    p.jones@uea.ac.uk
>>> NR4 7TJ
>>> UK 
>>> ----------------------------------------------------------------------------                                                                                 
>>>
>>
>>
> 


-- 
----------------------------------------------------------------------------
Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel:   (925) 422-2486
FAX:   (925) 422-7675
email: santer1@llnl.gov
---------------------------------------------------------------------------- 
</x-flowed>