[Esip-preserve] Fwd: [CF-metadata] Multiple file datasets (was: Swath observational data)

Fri Nov 20 10:37:44 EST 2009

We could use an empirical discussion of the structure of scientific
(Earth science and Space science) file collections.  The term "data set"
is heavily overloaded - and ambiguous.  That term is used loosely in
OAIS RM, not defined (as best I can recall) in CF, not defined (again,
as best I can recall) in the library community standards (METS,
PREMIS), and appears in Appendix H of ISO 19115.  This term
is important in Provenance discussions.

SAFE appears to follow the OAIS RM closely, as might be expected
from Don Sawyer's heavy involvement in the CCDSD standardization
work, although his work is also likely to be more closely tied to
Space science data than to Earth science data.  While Space science
data sometimes has similarity to Earth science data, the images are
a bit different, since they tend to have more discrete objects (stars -
although there are extended objects as well).

Again, what is the appropriate computer science (or mathematical)
data structure - and, if it's a hierarchy, what's the appropriate identifier
schema and naming convention for each collection in the hierarchy.
[I've got a publication if anyone's interested - but I won't send that
out unless there's some interest.]

In the mean time, a key question is "what is the structure of Earth
science file collections?".  It's also important to note that that's a very
different question from "what is the structure (whether queries or
navigation entry points and paths) of user data access searches?"
Is anyone aware of an empirical study of that structural question
that has used real examples from the NASA and NOAA archives?

Bruce B.

At 09:30 AM 11/20/2009, Christopher Lynnes wrote:
>An interesting discussion of SAFE on the cf-metadata mail list...
>
>Begin forwarded message:
>
>>From: Stephen Emsley <SEmsley at argans.co.uk>
>>Date: November 20, 2009 8:12:09 AM EST
>>To: John Caron <caron at unidata.ucar.edu>, "cf-metadata at cgd.ucar.edu" 
>><cf-metadata at cgd.ucar.edu >
>>Subject: Re: [CF-metadata] Multiple file datasets (was: Swath
>>observational data)
>>
>>>>Can anyone summarize what SAFE does?
>>
>>I will give it a shot as I brought it up in the first place!
>>
>>The Standard Archive Format for Europe (SAFE) was developed as a
>>common format for archiving to ensure long-term preservation of EO
>>data holdings, both historical and operational. The SAFE website 
>>[www.esa.int/safe ] is the official ESA maintained site for the 
>>maintenance and
>>distribution of the standard format, specification, XML-schemas and
>>tools.
>>
>>SAFE is a specialisation of the XML Formatted Data Unit (XFDU), a
>>CCSDS (Consultative Committee for Space Data Systems) recommended
>>standard for the packaging of data and metadata to facilitate
>>information transfer and archiving. Every SAFE product is an XFDU
>>package. SAFE is a specialisation of XFDU, which defines a
>>restriction of the generic XFDU package. SAFE inherits its main
>>structure from XFDU packaging format and defines high level
>>constraints and new rules for Earth Observation ground segment data
>>products.
>>
>>A SAFE product wraps, or references, data and associates that data
>>with metadata, both global and local. SAFE product metadata contains
>>basic information, such as the acquisition period, platform and
>>sensor identification and a processing history to ensure
>>traceability. For each included, or external referenced, dataset
>>another layer of associated metadata may be attached providing orbit
>>and geo-location information, quality information and
>>representational information.
>>
>>Basically a SAFE product is a directory. At the top level is a
>>manifest file, written in XML, that provides both a map of the
>>contained data sets, defines the relationships between these
>>datasets, and contains global metadata (such as platform name,
>>acquisition period etc.). There is a set of required metadata
>>defined by the SAFE specialisation (e.g. there is an ENVISAT
>>specialisation, further restricted to apply to, say, MERIS, and
>>still further specialised to, say, Level 1 processed products).
>>
>>The contained datasets are collections of records. They are of three
>>types:
>>
>>Measurement Data Sets: These are typically binary format files and,
>>in our case, will be netCDF-CF files. As an example we will have 46
>>measurement data products and each will be stored at a netCDF file
>>(data record) along with a data record containing associated quality
>>information and another containing status flags.
>>
>>Annotation Data Sets: These contain metadata and common data.
>>Although to be decided in the case of Sentinel 3 Level 2 we are
>>considering storing a common set of coordinate data that is
>>applicable to subsets of the measurement data. The manifest file
>>will provide the association between specific measurement datasets
>>and the associated coordinate data.
>>
>>Representation Data Sets: These are XML Schema descriptions of the
>>measurement and annotation datasets. Firstly it is a key concept for
>>OAIS digital preservation and secondarily third party applications
>>may use these for displaying / accessing the corresponding
>>measurement data sets. I appreciate that it might seem a little
>>'belt-and-braces' to have an XML schema for a netCDF file (which is
>>by nature self-describing) but that is how the SAFE people have
>>decided to include netCDF into the convention.
>>
>>There is a third type of data which can be considered as resources.
>>These may be, for instance, data required for the generation of the
>>end-user data products. For instance, for Level 2 data products they
>>would include the Level 1 input products and possibly, for instance,
>>ECMWF data required for processing (although the latter might
>>equally be an annotation dataset). These resources are not packaged
>>inside a SAFE container but are referenced (in the manifest file)
>>using a URI.
>>
>>All of these taken together are a SAFE package.
>>
>>I hope that this provides a reasonably informative overview. The
>>SAFE website is the place to go for more detailed info.
>>
>>Steve
>>
>>
>>---
>>Dr Stephen
>>Emsley
>>Tel: +44 (0)1752 764 289
>>  ARGANS
>>Limited
>>Mobile: +44 (0)7912 515 418
>>
>>
>>-----Original Message-----
>>From: cf-metadata-bounces at cgd.ucar.edu 
>>[mailto:cf-metadata-bounces at cgd.ucar.edu ] On Behalf Of John Caron
>>Sent: 20 November 2009 12:30
>>To: cf-metadata at cgd.ucar.edu
>>Subject: [CF-metadata] Multiple file datasets (was: Swath
>>observational data)
>>
>>This topic deserves its own heading, so here it is.
>>
>>Perhaps we should gather current practices and ideas. I think
>>Balaji's gridspec has a proposal about this. Can anyone summarize
>>what SAFE does?
>>
>>Im imagining how this is actually used, eg:
>>
>>float data(y,x);
>>  data:coordinates = "lat at file1 lon at file2";
>>
>>????
>>
>>
>>
>>John Graybeal wrote:
>>>I like Bryan's recommendation for a UUID or similar.
>>>
>>>Now I'm going to be annoying and suggest the UUID *could* be a URI,
>>>or
>>>these days, an IRI (International ..).
>>>
>>>And I think the way of 'locating' the file should be neither in
>>>packaging nor in local resolution; it should be in global namespace
>>>resolution.  This is the way of the future, and is already more
>>>'permanent' than either packaging or local resolution, IMHO.
>>>
>>>There is one form of URI in particular that is already resolvable: a
>>>URL.  OK, that's an old song, but I'm gonna stick to it for a while
>>>longer.  That form meets all the other requirements: it can be
>>>registered in a resolver, it can be guaranteed unique (to the same
>>>authority level as a UUID, anyway), and it is a unique string that
>>>can
>>>be used to validate the link).  And it has the obvious benefit of
>>>being
>>>resolvable right now, for as long as the domain is held and properly
>>>maintained (Good URLs don't die).
>>>
>>>Since the last paragraph risks starting another unique identifier
>>>war, I
>>>promise not to re-engage unless someone asks me to. Meanwhile, I like
>>>
>>>John
>>>
>>>
>>>On Nov 19, 2009, at 22:23, Bryan Lawrence wrote:
>>>
>>>>On Thursday 19 November 2009 19:40:08 Jonathan Gregory wrote:
>>>>>>    ...  In  some cases, referencing attributes such as
>>>>>>     "coordinates" and "ancillary_variables" would, ideally,
>>>>>>point to a
>>>>>>     variable in a different dataset.
>>>>>
>>>>>This is a general problem to which CF doesn't have a solution
>>>>>because
>>>>>it was
>>>>>conceived as a convention for single netCDF files. However we
>>>>>need a
>>>>>solution
>>>>>as often several files should be treated as a single dataset.
>>>>>
>>>>>If the files don't overlap i.e. their contents are complementary, I
>>>>>think it
>>>>>should be satisfactory to allow variables in one file to be pointed
>>>>>to by name
>>>>>from another file, with no other mechanism being required within
>>>>>the
>>>>>file. I
>>>>>don't like the idea of naming one file within another file, as that
>>>>>would be
>>>>>very fragile. Instead, I think the file aggregation should be
>>>>>implied by
>>>>>simply defining the group of files which are to be treated as one
>>>>>file e.g.
>>>>>by putting them in one directory.
>>>>
>>>>It's the old ones that are the best ones :-) :-)  this issue keeps
>>>>on
>>>>coming back ... :-) :-) and we keep trying to ignore it ...
>>>>
>>>>I think we agree that an actual physical filename including path is
>>>>useless. We need both  a relative link which relies on the
>>>>preservation of a group of files in a particular arrangement ...
>>>>AND
>>>>an internal identifier so more robust linking mechanisms can be used
>>>>when (if) the data ends up in a managed environment.
>>>>
>>>>I think it's crucial in this situation to ensure that each file
>>>>has a
>>>>unique identifier within it (created, for example, with uuid),
>>>>because
>>>>all solutions which rely on packaging are fragile (SAFE is probably
>>>>better than most), but the bottom line is that users move files
>>>>around
>>>>... and we need some way of ensuring that we/they can validate the
>>>>links that are in place are the ones that were originally intended.
>>>>
>>>>So relative links would also include the identifier of the intended
>>>>target as well as the relative path in operating system agnostic
>>>>terms.
>>>>
>>>>That identifier can be used in two ways: to validate the link (my
>>>>software can always check that the variable that I just opened
>>>>following a link from another one is the one that was expected by
>>>>checking the container identifier), and b) to produce an identifier
>>>>resolver service for the situation where the packaging has had to be
>>>>broken (which might occur for performance reasons or ...)
>>>>
>>>>CF could recommend something like this ...
>>>>
>>>>Bryan
>>>>
>>>>--
>>>>Bryan Lawrence
>>>>Director of Environmental Archival and Associated Research
>>>>(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>>>STFC, Rutherford Appleton Laboratory
>>>>Phone +44 1235 445012; Fax ... 5848;
>>>>Web: home.badc.rl.ac.uk/lawrence
>>>>_______________________________________________
>>>>CF-metadata mailing list
>>>>CF-metadata at cgd.ucar.edu
>>>>http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>>
>>>
>>>--------------
>>>I have my new work email address: jgraybeal at ucsd.edu
>>>--------------
>>>
>>>John Graybeal   <mailto:jgraybeal at ucsd.edu>
>>>phone: 858-534-2162
>>>Development Manager
>>>Ocean Observatories Initiative Cyberinfrastructure Project:
>>>http://ci.oceanobservatories.org
>>>Marine Metadata Interoperability Project: http://marinemetadata.org
>>>
>>>_______________________________________________
>>>CF-metadata mailing list
>>>CF-metadata at cgd.ucar.edu
>>>http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>
>>_______________________________________________
>>CF-metadata mailing list
>>CF-metadata at cgd.ucar.edu
>>http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>>_______________________________________________
>>CF-metadata mailing list
>>CF-metadata at cgd.ucar.edu
>>http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>
>--
>Christopher Lynnes             NASA/GSFC, Code 610.2
>301-614-5185
>
>_______________________________________________
>Esip-preserve mailing list
>Esip-preserve at lists.esipfed.org
>http://www.lists.esipfed.org/mailman/listinfo/esip-preserve