Fwd: A Couple of Notes on Info Quality and Data Preservation

Bruce Barkstrom
Mon Dec 10 15:02:30 EST 2012

I sent this note to Brent Maddux Sunday.
I think the concerns here have at least two points
at which they tie to the preservation WG:
- what information should be displayed on the Landing Points?
- how do we develop and then preserve caveats to best
   tie in to user needs?

Bruce B.


Bruce Barkstrom
Date: Sun, Dec 9, 2012 at 11:30 AM
Subject: A Couple of Notes on Info Quality and Data Preservation


As you may recall, Dr. Schloss had sent out a call
for data sources for precipitation data that might be
easier for her to use than the GHCN data from NCDC.
The ESIP community responded with at least a half
dozen different portals for obtaining such data (and
a bunch of other stuff, like temperatures).

>From my perspective, the responses had the potential
to create a lot of confusion - depending on what Dr. Schloss
wanted to do.  In trying to clarify that in some e-mails
with her, I think what she wanted was some data that
could be used for a simple plotting exercise that I think
one might expect for students in a first class in science.
The exact content of the data she received probably
didn't matter too much.

On the other hand, if her interest had been in climate
change, I think the responses could create a lot of
confusion.  Let me just stick to the precipitation record,
where I've done some personal exploration of the GHCN

The responses from the ESIP members didn't seem to
recognize the coverage issue and related uncertainty
changes that various data sources have.  The original
GHCN data start with a handful of US stations back about
1835 (if I recall correctly without checking on my notes).
At that point, a handful of stations in the center of the US
started to collect data.  Over the 1800's, the effort spread
around the world.  The number of stations peaked in the
mid-2000's and has since declined.  One of the critical
questions for climatological work is how does a researcher
deal with the changes in the number of stations?

A second point of confusion arises in dealing with the
spatial distribution of data.  The rainguage data come from
discrete stations - essentially the values arise from
measurements on spatial points (or from areas of about
0.5 m^2 or less).  Thus, data presented as though they
arise from an image or spatial grid must have interpolated
between the points - even if all of the data from these
points comes from exactly the same time interval.

There is a very small number of data stations in
the South Pacific. If the data values retrieved from a
Web site are presented on some kind of uniform spatial
grid, the values over that part of the Earth are probably
more dependent on the interpolation method than on
the data from the surrounding coastlines.  Could there
be spurious or undetected seasonal patterns in such
"image-like" or gridded data?  Almost certainly.

Could this difficulty be eased by adding satellite data?
Well, as long as the climatology starts after 1950 (or,
maybe 1970), when satellite instruments existed.
If this is still acceptable, measuring precipitation from
satellites is still a bit of a research issue if I recall
correctly.  The most direct measurements usually
use passive microwave or radar - but that's comparatively
recently.  Furthermore, such measurements have a
very different sampling pattern than do the surface rainguage
measurements.  The usual Sun-synch satellites may return
to the same area about once every sixteen days or so if
the samples are only along the ground path, while if the
instrument is scanning, it could be a bit more frequent.
So a related issue is "how should we combine the satellite
instrument sampling with the rainguage measurements?"

One issue this suggests for information quality might
be developing guidelines for intercomparing different
data sets that portal sites might provide to users.
For example, it would seem reasonable to have such
portal sites create pictures of the data coverage in
decadal time periods (e.g. 1850-1860, 1860-1870, ...)
showing where data were collected (maybe as a global
map of spatial points or areal coverage - where the data
are really covering the areas presented).

There's also a need to be careful in such presentations to
make sure that the quantities being presented are really
measuring the same thing.  For example, comparing
radiosonde temperature profiles with temperature profiles
created from infrared sounders, it's important for a data
user to know that the volume of air sampled by the thermometer
and hygrometer on the ascending balloon is probably best
visualized as a narrow tube of gas that flows through the
instrument during its ascent.  On the other hand, the volume
of air sampled for each point in an infrared sounder has a
much larger diameter and comes from a layer that's roughly
a km thick (the exact "thickness" depends on how the
radiances are inverted to produce the temperature profile).

Let me know if you'd like this set of comments placed
in the Web site or more widely distributed.

Bruce B.
