cc: carl mears <mears@remss.com>, Karl Taylor <taylor13@llnl.gov>,  Tom Wigley <wigley@ucar.edu>, Tom Wigley <wigley@cgd.ucar.edu>,  "Thorne, Peter" <peter.thorne@metoffice.gov.uk>, Steven Sherwood <Steven.Sherwood@yale.edu>,  John Lanzante <John.Lanzante@noaa.gov>, "'Dian J. Seidel'" <dian.seidel@noaa.gov>,  Melissa Free <Melissa.Free@noaa.gov>, Frank Wentz <frank.wentz@remss.com>, Steve Klein <klein21@mail.llnl.gov>,  Leopold Haimberger <leopold.haimberger@univie.ac.at>, peter gleckler <gleckler1@llnl.gov>
date: Wed, 05 Dec 2007 14:19:17 -0800
from: Ben Santer <santer1@llnl.gov>
subject: Re: [Fwd: sorry to take your time up, but really do need a   scrub
to: Phil Jones <p.jones@uea.ac.uk>

<x-flowed>
Dear Phil,

Just a quick response to the issue of "model weighting" which you and 
Carl raised in your emails.

We recently published a paper dealing with the identification of an 
anthropogenic fingerprint in SSM/I-based estimates of total column water 
vapor changes. This was a true multi-model detection and attribution 
("D&A") study, which made use of results from 22 different A/OGCMs for 
fingerprint and noise estimation. Together with Peter Gleckler and Karl 
Taylor, I'm now in the process of repeating our water vapor D&A study 
using a subset of the original 22 models. This subset will comprise 
10-12 models which are demonstrably more successful in capturing 
features of the observed mean state and variability of water vapor and 
SST - particularly features crucial to the D&A problem (such as the 
low-frequency variability). We've had fun computing a whole range of 
metrics that might be used to define such a subset of "better" models. 
The ultimate goal is to determine the sensitivity of our water vapor D&A 
results to model quality. I think that this kind of analysis will be 
unavoidable in the multi-model world in which we now live. Given 
substantial inter-model differences in simulation quality, "one model, 
one vote" is probably not the best policy for D&A work!

Once we've used Carl's method to calculate synthetic MSU temperatures 
from the IPCC AR4 20c3m data (as described in my previous email), it 
should be relatively easy to do a similar "model culling" exercise with 
MSU T2, T4, and TLT. In fact, this is what we had already planned to do 
in collaboration with Carl and Frank.

One key point in any model weighting or selection strategy is to avoid 
circularity. In the D&A context, it would be impermissible to include 
information on trend behavior as a criterion used for selecting "better" 
models. Likewise, if our interest is in assessing the statistical 
significance of model-versus-observed trend differences, we can't use 
model performance in simulating "observed" tropospheric or stratospheric 
trends (whatever those might be!) as a means of identifying more 
credible models.

A further issue, of course, is that we are relying on results from fully 
coupled A/OGCMs, and are making trend comparisons over relatively short 
periods (several decades). On these short timescales, estimates of the 
"true" trend in response to the applied 20c3m forcings are quite 
sensitive to natural variability noise (as Peter Thorne's 2007 GRL paper 
clearly illustrates). Because of such chaotic variability, even a 
hypothetical model with perfect physics and forcings would yield a 
distribution of tropospheric temperature trends over 1979 to 1999, some 
of which would show larger or smaller cooling than observed. This is why 
it's illogical to stratify model results according to correspondence 
between modeled and observed surface warming - something which John 
Christy is very fond of doing.

What we've done (in the new water vapor work described above) is to 
evaluate the fidelity with which the AR4 models simulate the observed 
mean state and variability of precipitable water and SST - not the 
trends in these quantities. We've looked at a model performance in a 
variety of different regions, and on multiple timescales. The results 
are fascinating, and show (at least for water vapor and SST) that every 
model has its own individual strengths and weaknesses. It is difficult 
to identify a subset of models that CONSISTENTLY does well in many 
different regions and over a range of different timescales.

My guess is that we would obtain somewhat different results for MSU 
temperatures - particularly for comparisons involving variability. 
Clearly, the absence of volcanic forcing in roughly half of the 20c3m 
experiments will have a large impact on the estimated variability of 
synthetic T4 temperatures (and perhaps even on T2), and hence on 
model-versus-data variability comparisons. It's also quite possible that 
the inclusion or absence of volcanic forcing has an impact not only on 
the amplitude of the variability of global-mean T4 anomalies, but also 
on the pattern of T4 variability. So model ranking exercises based on 
performance in simulating the mean state and variability of T4 and T2 
may show some connection to the presence or absence of volcanic/ozone 
forcing.

The sad thing is we are being distracted from doing this fun stuff by 
the need to respond to Douglass et al. That's a real shame.

With best regards,

Ben

Phil Jones wrote:
>  All,
>    IJC do have comments but only very rarely. I see little point in 
> doing this
>  as there is likely to be a word limit, and if the system works properly
>  Douglass et al would get the final say. There is also a large backlog in
>  papers awaiting to appear, so even if the comment were accepted it would
>  be some time after Douglass et al that it would appear.
>    Better would be a submission to another journal (JGR?) which
>  would be quicker. This could go in before Douglass et al appeared in
>  print - it should be in the IJC early online view fairly soon based on
>  recent experiences.
>    A paper pointing out the issues of trying to weight models in some way
>  would be very beneficial to the community. AR5 will have to go down this
>  route at some point. How models simulate the
>  recent trends at the surface and in the troposphere/stratosphere and
>  how they might be ranked is a possibility. This could bring in the
>  new work Peter alludes to with the sondes.
>    There are also some aspects of recent surface T changes that could be
>  discussed as well. These relate to the growing dominance of buoy SSTs
>  (now 70% of the total) vs conventional ships. There is a paper in J. 
> Climate
>  accepted from Smith/Reynolds et al at NCDC, which show that buoys
>  could conceivably be cooler than ship-based SST by about 0.1C - meaning
>  that the last 5-10 years are being gradually underestimated over the 
> oceans.
>  Overlap is still too short to be confident about this, but it highlights a
>  major systematic change occurring in surface ocean measurements. As the
>  buoys are presumably better for absolute SSTs, this means models
>  driven with fixed SSTs should be using fields that are marginally cooler.
> 
>    And then there is the continual reference to Kalnay and Cai, when
>  Simmons et al (2004) have shown the problems with NCEP. It is possible
>  to add in the ERA-Interim analyses and operational analyses to
>  being results from ERA-40 up to date.
> 
>  Cheers
>  Phil
> 
> 
> At 23:40 04/12/2007, carl mears wrote:
>> Karl -- thanks for clarifying what I was trying to say
>>
>> Some further comments.....
>>
>> At 02:53 PM 12/4/2007, Karl Taylor wrote:
>>> Dear all,
>>> 2) unforced variability hasn't dominated the observations.
>>
>> But on this short time scale, we strongly suspect that it has 
>> dominated.  For example, the
>> 2 sigma error bars from table 3.4, CCSP for satellite TLT are 0.18 
>> (UAH) or 0.19 (RSS), larger
>> than either group's trends (0.05, 0.15) for 1979-2004.  These were 
>> calculated using a "goodness
>> of linear fit"  criterion, corrected for autocorrelation.  This is a 
>> probably a reasonable
>> estimate of the contribution of unforced variability to trend 
>> uncertainty.
>>
>>
>>
>>> Douglass et al. have *not* shown that every individual model is in 
>>> fact inconsistent with the observations.  If the spread of individual 
>>> model results is large enough and at least 1 model overlaps the 
>>> observations, then one cannot claim that all models are wrong, just 
>>> that the mean is biased.
>>
>>
>> Given the magnitude of the unforced variability, I would say "the mean 
>> *may* be biased."  You can't prove this
>> with only one universe, as Tom alluded.  All we can say is that the 
>> observed trend cannot be proven to
>> be inconsistent with the model results, since it is inside their range.
>>
>> It we interesting to see if we can say anything more, when we start 
>> culling out the less realistic models,
>> as Ben has suggested.
>>
>> -Carl
>>
>>
>>
>>
> 
> Prof. Phil Jones
> Climatic Research Unit        Telephone +44 (0) 1603 592090
> School of Environmental Sciences    Fax +44 (0) 1603 507784
> University of East Anglia
> Norwich                          Email    p.jones@uea.ac.uk
> NR4 7TJ
> UK 
> ----------------------------------------------------------------------------                                                                                 
> 


-- 
----------------------------------------------------------------------------
Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel:   (925) 422-2486
FAX:   (925) 422-7675
email: santer1@llnl.gov
---------------------------------------------------------------------------- 
</x-flowed>