[Esip-preserve] Identifiers

Thu Feb 16 14:53:26 EST 2012

This strikes me as an insightful comment.  The first paragraph, for
example, adds some definitions (or alternate terms) to the list:

Collection:
1.  Collection of data sets [Example: collection from a field campaign
or collection from a particular instrument]
2.  Suite of data sets.

Dataset:
1.  Granule
2.  THREDDS context: approximately "granule"

Curt may want to deal primarily with identifiers.  I think we
need to be able to identify "objects" - meaning collections of
data, metadata, and documentation.  It seems to me to be
inappropriate to expect to apply an HDF reader to a pdf
document that comes as part of a granule that contains
supporting documentation.

In the next paragraph, on the literary purpose of DOI's,
we need to be careful about the library science definitions
of objects.  The basic item of bibliographic reference is the
"work," which is a text document in some of the material
on library science.  For Earth science, the objects are often
numbers.  The digital form of a number is much more variable
than the characters in a text - and the sequence of numbers
emerging from a serialized file is not unique.  As an example
you could store numbers in a multi-dimensional C array or
a multi-dimensional FORTRAN array.  The arrays would have
a different sequence of numbers, yet the scientific content
is not affects as long as the measurements are tied properly
to the time and space sampling of the fields being measured.
Thus, it seems to me that we're going to need names that
don't require identical bit arrays if we want to deal with scientifically
equivalent data that's been reformatted, but did not otherwise
change the values.

Bruce B.

On Thu, Feb 16, 2012 at 11:55 AM, Mark A. Parsons <parsonsm at nsidc.org> wrote:
> I have seen a data set referred to as a collection in some communities. I have also seen collection refer to a collection of data sets, like a collection from a field campaign or a particular instrument. This also sometimes called a suite of data sets. I have also seen dataset (i.e. misspelled as one word) refer to what we often call granules. Think of datasets in THREDDS. I think these are all legitimate terms that mean things in context.. I think the question Curt is really trying to address is what kind of identifiers do we assign to what kind of things, regardless of what we call them
>
> For DOIs, whose primary purpose is literary citation, I think they should be applied to some sort of logical grouping of data (that logic is defined by the author and stewards and might usually consider what  data  are typically used in conjunction with each other). These things are typically called data sets and/or collections or maybe AIPs. They can have different levels of hierarchy (or not) that get DOIs such as one for the collection or suite and several for the data sets in that collection. DOIs are not suited for deep hierarchies or detailed identification, though, because of their financial and administrative costs. Perhaps, another way of thinking about it is that DOIs should typically point to some sort landing page, which implies that there is some sort of set or collection underlying that page, not just a single item.
>
> More detailed items in the hierarchy or web of data need different kinds of identifiers. Perhaps also data that are arranged hierarchically may use different identifiers that those that are arranged in more of a linked-graph based collection or distributed "e-science object". I'm sure there are other considerations as well. For example, one also needs to consider whether the identifier needs to be actionable.
>
> While we can say in our community that certain identifiers should be used  for "data sets" or "granules" or "collections" or whatever, what we really need to do is define the effective use of recommended identifiers for different types of things (including non-digital things and transformations of things). English is imprecise, one word doesn't always cut it.
>
> Cheers,
>
> -m.
>
>
>
>
> On 16 Feb 2012, at 8:54 AM, Mark A. Parsons wrote:
>
>> Personally, I think defining a data set too precisely is a fools errand. It is the responsibility of the data authors and stewards to define something that makes sense for their designated community and slap a DOI and a name on to it.
>>
>> To me a data set is simply a logical arrangement of data that has meaning to a designated community.
>>
>> Your definition below, for example, does not work for many, perhaps the majority, of NSIDC data sets.
>>
>> Cheers,
>>
>> -m.
>> On 16 Feb 2012, at 8:06 AM, Curt Tilmes wrote:
>>
>>> On 02/15/2012 03:48 PM, Bruce Barkstrom wrote:
>>>> It would be useful to at least having some clear definitions of
>>>> things.
>>>>
>>>> So - to go back to the "undefined term" "data set" does this term
>>>> refer to
>>>
>>> Yes, we need to define it.  We keep putting it off.  Let's debate this
>>> one now.  We might not come to complete agreement, but perhaps we can
>>> refine this sufficiently to come up with something we can make use of.
>>>
>>>
>>> I use it for something comparable to the EOSDIS Data Model concept of
>>> Earth Science Data Type (ESDT) + Collection.
>>>
>>> So, for example, the { "MODIS/Terra Snow Cover 5-Min L2 Swath 500m"
>>> (MOD10_L2), "Collection 5" } is one dataset.
>>>
>>> { MOD10_L2, Collection 6 } would be a distinct dataset, and need a
>>> distinct identifier (eventually DOI).
>>>
>>>
>>> I'm also not wedded to the term "dataset" for this concept -- if
>>> someone can sell me on an alternative.  I just think we need some term
>>> for this concept we can all live with..  "dataset" is the most natural
>>> I can come up with.
>>>
>>>
>>> A couple notes for people who don't speak "NASA EOSDIS Data Model":
>>>
>>> 1. A dataset is made up of granules.
>>>
>>> 2. Each granule in the dataset was made in a "common" (I won't define
>>> that for now) way.
>>>
>>> 3. Each granule in the dataset has a common format, metadata, filename
>>> convention, etc.  A reader for one granule will also be able to read
>>> another granule from the same dataset.
>>>
>>>
>>> I'll further add these definitions for discussion:
>>>
>>> A "static dataset" doesn't change.  The set of granules and their
>>> particular contents is constant.
>>>
>>> A "dynamic dataset" can change.  For example, the datasets above will
>>> grow every day since they are part of an ongoing NASA mission that
>>> keeps capturing and processing new data.  The granules that were part
>>> of the dataset yesterday and the granules that are part of the dataset
>>> today are different.  (I know this causes some folks heartburn, but it
>>> is a reality we need to accomodate.)
>>>
>>>
>>> Once we get dataset straight, we can talk about subsets/other
>>> aggregations.
>>>
>>>
>>> --
>>> Curt Tilmes
>>> U.S. Global Change Research Program
>>> 1717 Pennsylvania Avenue NW, Suite 250
>>> Washington, D.C. 20006, USA
>>>
>>> +1 202-419-3479 (office)
>>> +1 443-987-6228 (cell)
>>> _______________________________________________
>>> Esip-preserve mailing list
>>> Esip-preserve at lists.esipfed.org
>>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>>
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve