[Esip-preserve] Identifiers

Wed Feb 15 16:38:31 EST 2012

As another curmudgeonly note, I think it would be useful to
provide some use cases about the motivation of "linked data
creators".  To be a bit more specific, what is the difference
between a "cut and paste scrapbook" of images and text
from other Web sources and an actual Earth science data
use?

That isn't to say a "scrapbook" doesn't have it's uses.
A student report that cut images and supporting text from
Web pages might resemble a "scrapbook", yet still be useful
at getting a high grade for the student.

On the other hand, one would expect a normal scientific
investigation to have to be concerned about the uncertainty
in the data products.  According to the
www.bipm.org/en/publications/guides/gum.html (Guide to
the Expression of Uncertainty in Measurement), uncertainty
statements require a mathematical model of the measurement
process.  So - when we deal with "linked data products" for
Earth sciences, are we requiring the data producer to create
and document the math model and provide evidence that it
is a reasonable statement of the level of confidence in the
measurements recorded in the data product?

There are lots of variants and levels of fulfillment to the GUM
standard (yes, it's ISO).  What role does the expectation that
"linked data" will deal with scientific investigations play in the
kinds of products this approach encourages?

Bruce b.

On Wed, Feb 15, 2012 at 3:48 PM, Bruce Barkstrom <brbarkstrom at gmail.com> wrote:
> It would be useful to at least having some clear definitions of things.
>
> So - to go back to the "undefined term" "data set" does this term
> refer to
> - a single data value
> - a collection of data values (as a subset of data values in a file or
> a result set from a query to a database that contains tables of
> data values)
> - a file
> - a database
> - a collection of files containing data values
> - a collection of relational database tables containing data values
> - collections of "stuff" that contain both data values and documentation
> objects
>
> As an extension of this definition, if we're going to deal with subsets
> of collections of things, how do we identify the subset:
> - enumerate the elements
> - invoke a hierarchical description, where each level in the hierarchy
> provides a list of the child elements
> - invoke a mathematical graph, where a "query" produces a "list"
> of elements that depends on where you start and how you choose
> to traverse the graph
>
> It would clearly be useful to get persistent names whose references
> can survive in the face of hardware evolution (see Marowka, A., 2011:
> Back to Thin-Core Massively Parallel Processors, Computer, 44,
> No. 12, 49-54), software obsolescence (see Lyon, D., 2012:
> The Java Tree Withers, Computer, 45, No. 1, 83-85), and proprietary
> software (Norman, D. A., 2012: Yet Another Technology Cust:
> Confusion, Vendor Wars, and Opportunities, 55, No. 2, 30-32 and
> Olsen, K. A. and Maltzia, A., 2012: Interfaces for the Ordinary User,
> Can We Hide Too Much?, CACM, 55, No. 1, 38-40).
>
> Finally, testing for reliable scientific grounding of claims for
> "linked data" is going to be a very messy business.  In much of
> the current data production (whether by small teams doing
> exploratory data analysis - where the number of data sources
> is relatively small, or with the "industrial production" that satellite
> teams and in situ data collection networks use - where the number
> of data sources is fixed by the production design), keeping track
> of sources is doable, but laborious.  In the case of "linked data",
> the variety of sources may be much larger and the links may be
> transient.  These complexities should be incorporated as early
> in the design process for creating software to build such entities
> as possible.
>
> Bruce B.
>
> On Wed, Feb 15, 2012 at 2:26 PM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:
>>
>> As part of our ongoing Identifiers Activity [1] I'd like to post a few
>> thoughts (repeating/recapping some previous discussion and thinking
>> about where to go next).
>>
>> We've made the case to identify datasets (whatever that means) with
>> DOIs [2].  Now we want to look beyond them.
>>
>>
>> I am a strong advocate for "Linked Data" [3,4] (and in particular
>> "Linked Open Data") which is summarized with these ideas:
>>
>>    1. Use URIs as names for things
>>
>>    2. Use HTTP URIs so that people can look up those names.
>>
>>    3. When someone looks up a URI, provide useful information, using
>>       the standards (RDF*, SPARQL)
>>
>>    4. Include links to other URIs. so that they can discover more
>>       things.
>>
>>
>> I really want to have a nice URI scheme where we can fit all the
>> 'things' we're talking about (see PCCS [5]).  This will have all sorts
>> of benefits I won't list here..  I want to concentrate on how to
>> accomplish it.
>>
>>
>> 1-3 are pretty easy, we just need to assign HTTP URIs and start using
>> them.
>>
>> For 4, we really need persistency and adoption.  Cool URIs don't
>> change [6] and we really want Cool URIs [7].
>>
>>
>> In the data identifiers paper, we discussed the importance of
>> persistency and described several schemes with methods of
>> accomplishing that.  We want to assign permanent URIs to things and
>> put a system and process in place to ensure that the URIs are always
>> resolvable.  Using a hopefully somewhat permanent prefix
>> (e.g. http://globalchange.gov or http://climate.data.gov) with a
>> server that can always resolve all the assigned identifiers, either
>> locally, or by delegating to another server I think we can accomplish
>> persistency.  (At least in the short term ;-))
>>
>> Adoption is an interesting question.  These identifiers are most
>> useful when we we use the same identifier when referring to the same
>> thing. [8] But, we all want to say different things about things.  We
>> can also do things like assigning separate identifiers to things, but
>> asserting (owl:sameAs) equivalency between one identifier and another.
>> More on this later..
>>
>> Some of the above referenced documents talk about a lot of guidelines
>> for developing "Cool URIs".
>>
>> The UK government has gone a bit further with their data.gov.uk plan
>> [9].  It seems like *some* of the things we are trying to assign
>> identifiers to could fit nicely into an application of something
>> similar to their scheme.
>>
>> Other thoughts?
>>
>> Curt
>>
>> [1] http://wiki.esipfed.org/index.php/Identifiers_Activity
>> [2] http://dx.doi.org/10.1007/s12145-011-0083-6
>> [3] http://www.w3.org/DesignIssues/LinkedData.html
>> [4] http://linkeddata.org/
>> [5]
>> http://wiki.esipfed.org/index.php/Provenance_and_Context_Content_Standard
>> [6] http://www.w3.org/Provider/Style/URI.html
>> [7] http://www.w3.org/TR/cooluris/
>> [8] http://www.w3.org/TR/webarch/
>> [9]
>> http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve