[Esip-preserve] Fwd: [CF-metadata] Multiple file datasets (was: Swath observational data)

Christopher Lynnes Chris.Lynnes at nasa.gov
Fri Nov 20 09:30:00 EST 2009

An interesting discussion of SAFE on the cf-metadata mail list...

Begin forwarded message:

> From: Stephen Emsley <SEmsley at argans.co.uk>
> Date: November 20, 2009 8:12:09 AM EST
> To: John Caron <caron at unidata.ucar.edu>, "cf-metadata at cgd.ucar.edu" <cf-metadata at cgd.ucar.edu 
> >
> Subject: Re: [CF-metadata] Multiple file datasets (was: Swath  
> observational data)
>>> Can anyone summarize what SAFE does?
> I will give it a shot as I brought it up in the first place!
> The Standard Archive Format for Europe (SAFE) was developed as a  
> common format for archiving to ensure long-term preservation of EO  
> data holdings, both historical and operational. The SAFE website [www.esa.int/safe 
> ] is the official ESA maintained site for the maintenance and  
> distribution of the standard format, specification, XML-schemas and  
> tools.
> SAFE is a specialisation of the XML Formatted Data Unit (XFDU), a  
> CCSDS (Consultative Committee for Space Data Systems) recommended  
> standard for the packaging of data and metadata to facilitate  
> information transfer and archiving. Every SAFE product is an XFDU  
> package. SAFE is a specialisation of XFDU, which defines a  
> restriction of the generic XFDU package. SAFE inherits its main  
> structure from XFDU packaging format and defines high level  
> constraints and new rules for Earth Observation ground segment data  
> products.
> A SAFE product wraps, or references, data and associates that data  
> with metadata, both global and local. SAFE product metadata contains  
> basic information, such as the acquisition period, platform and  
> sensor identification and a processing history to ensure  
> traceability. For each included, or external referenced, dataset  
> another layer of associated metadata may be attached providing orbit  
> and geo-location information, quality information and  
> representational information.
> Basically a SAFE product is a directory. At the top level is a  
> manifest file, written in XML, that provides both a map of the  
> contained data sets, defines the relationships between these  
> datasets, and contains global metadata (such as platform name,  
> acquisition period etc.). There is a set of required metadata  
> defined by the SAFE specialisation (e.g. there is an ENVISAT  
> specialisation, further restricted to apply to, say, MERIS, and  
> still further specialised to, say, Level 1 processed products).
> The contained datasets are collections of records. They are of three  
> types:
> Measurement Data Sets: These are typically binary format files and,  
> in our case, will be netCDF-CF files. As an example we will have 46  
> measurement data products and each will be stored at a netCDF file  
> (data record) along with a data record containing associated quality  
> information and another containing status flags.
> Annotation Data Sets: These contain metadata and common data.  
> Although to be decided in the case of Sentinel 3 Level 2 we are  
> considering storing a common set of coordinate data that is  
> applicable to subsets of the measurement data. The manifest file  
> will provide the association between specific measurement datasets  
> and the associated coordinate data.
> Representation Data Sets: These are XML Schema descriptions of the  
> measurement and annotation datasets. Firstly it is a key concept for  
> OAIS digital preservation and secondarily third party applications  
> may use these for displaying / accessing the corresponding  
> measurement data sets. I appreciate that it might seem a little  
> 'belt-and-braces' to have an XML schema for a netCDF file (which is  
> by nature self-describing) but that is how the SAFE people have  
> decided to include netCDF into the convention.
> There is a third type of data which can be considered as resources.  
> These may be, for instance, data required for the generation of the  
> end-user data products. For instance, for Level 2 data products they  
> would include the Level 1 input products and possibly, for instance,  
> ECMWF data required for processing (although the latter might  
> equally be an annotation dataset). These resources are not packaged  
> inside a SAFE container but are referenced (in the manifest file)  
> using a URI.
> All of these taken together are a SAFE package.
> I hope that this provides a reasonably informative overview. The  
> SAFE website is the place to go for more detailed info.
> Steve
> ---
> Dr Stephen  
> Emsley 
> Tel: +44 (0)1752 764 289
> Limited                                                               
> Mobile: +44 (0)7912 515 418
> -----Original Message-----
> From: cf-metadata-bounces at cgd.ucar.edu [mailto:cf-metadata-bounces at cgd.ucar.edu 
> ] On Behalf Of John Caron
> Sent: 20 November 2009 12:30
> To: cf-metadata at cgd.ucar.edu
> Subject: [CF-metadata] Multiple file datasets (was: Swath  
> observational data)
> This topic deserves its own heading, so here it is.
> Perhaps we should gather current practices and ideas. I think  
> Balaji's gridspec has a proposal about this. Can anyone summarize  
> what SAFE does?
> Im imagining how this is actually used, eg:
> float data(y,x);
>  data:coordinates = "lat at file1 lon at file2";
> ????
> John Graybeal wrote:
>> I like Bryan's recommendation for a UUID or similar.
>> Now I'm going to be annoying and suggest the UUID *could* be a URI,  
>> or
>> these days, an IRI (International ..).
>> And I think the way of 'locating' the file should be neither in
>> packaging nor in local resolution; it should be in global namespace
>> resolution.  This is the way of the future, and is already more
>> 'permanent' than either packaging or local resolution, IMHO.
>> There is one form of URI in particular that is already resolvable: a
>> URL.  OK, that's an old song, but I'm gonna stick to it for a while
>> longer.  That form meets all the other requirements: it can be
>> registered in a resolver, it can be guaranteed unique (to the same
>> authority level as a UUID, anyway), and it is a unique string that  
>> can
>> be used to validate the link).  And it has the obvious benefit of  
>> being
>> resolvable right now, for as long as the domain is held and properly
>> maintained (Good URLs don't die).
>> Since the last paragraph risks starting another unique identifier  
>> war, I
>> promise not to re-engage unless someone asks me to. Meanwhile, I like
>> John
>> On Nov 19, 2009, at 22:23, Bryan Lawrence wrote:
>>> On Thursday 19 November 2009 19:40:08 Jonathan Gregory wrote:
>>>>>    ...  In  some cases, referencing attributes such as
>>>>>     "coordinates" and "ancillary_variables" would, ideally,  
>>>>> point to a
>>>>>     variable in a different dataset.
>>>> This is a general problem to which CF doesn't have a solution  
>>>> because
>>>> it was
>>>> conceived as a convention for single netCDF files. However we  
>>>> need a
>>>> solution
>>>> as often several files should be treated as a single dataset.
>>>> If the files don't overlap i.e. their contents are complementary, I
>>>> think it
>>>> should be satisfactory to allow variables in one file to be pointed
>>>> to by name
>>>> from another file, with no other mechanism being required within  
>>>> the
>>>> file. I
>>>> don't like the idea of naming one file within another file, as that
>>>> would be
>>>> very fragile. Instead, I think the file aggregation should be  
>>>> implied by
>>>> simply defining the group of files which are to be treated as one
>>>> file e.g.
>>>> by putting them in one directory.
>>> It's the old ones that are the best ones :-) :-)  this issue keeps  
>>> on
>>> coming back ... :-) :-) and we keep trying to ignore it ...
>>> I think we agree that an actual physical filename including path is
>>> useless. We need both  a relative link which relies on the
>>> preservation of a group of files in a particular arrangement ...   
>>> AND
>>> an internal identifier so more robust linking mechanisms can be used
>>> when (if) the data ends up in a managed environment.
>>> I think it's crucial in this situation to ensure that each file  
>>> has a
>>> unique identifier within it (created, for example, with uuid),  
>>> because
>>> all solutions which rely on packaging are fragile (SAFE is probably
>>> better than most), but the bottom line is that users move files  
>>> around
>>> ... and we need some way of ensuring that we/they can validate the
>>> links that are in place are the ones that were originally intended.
>>> So relative links would also include the identifier of the intended
>>> target as well as the relative path in operating system agnostic  
>>> terms.
>>> That identifier can be used in two ways: to validate the link (my
>>> software can always check that the variable that I just opened
>>> following a link from another one is the one that was expected by
>>> checking the container identifier), and b) to produce an identifier
>>> resolver service for the situation where the packaging has had to be
>>> broken (which might occur for performance reasons or ...)
>>> CF could recommend something like this ...
>>> Bryan
>>> --
>>> Bryan Lawrence
>>> Director of Environmental Archival and Associated Research
>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>> STFC, Rutherford Appleton Laboratory
>>> Phone +44 1235 445012; Fax ... 5848;
>>> Web: home.badc.rl.ac.uk/lawrence
>>> _______________________________________________
>>> CF-metadata mailing list
>>> CF-metadata at cgd.ucar.edu
>>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
>> --------------
>> I have my new work email address: jgraybeal at ucsd.edu
>> --------------
>> John Graybeal   <mailto:jgraybeal at ucsd.edu>
>> phone: 858-534-2162
>> Development Manager
>> Ocean Observatories Initiative Cyberinfrastructure Project:
>> http://ci.oceanobservatories.org
>> Marine Metadata Interoperability Project: http://marinemetadata.org
>> _______________________________________________
>> CF-metadata mailing list
>> CF-metadata at cgd.ucar.edu
>> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
> _______________________________________________
> CF-metadata mailing list
> CF-metadata at cgd.ucar.edu
> http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Christopher Lynnes             NASA/GSFC, Code 610.2          

More information about the Esip-preserve mailing list