[Esip-preserve] Identifiers

Bruce Barkstrom brbarkstrom at gmail.com
Wed Feb 15 15:48:29 EST 2012

It would be useful to at least having some clear definitions of things.

So - to go back to the "undefined term" "data set" does this term
refer to
- a single data value
- a collection of data values (as a subset of data values in a file or
a result set from a query to a database that contains tables of
data values)
- a file
- a database
- a collection of files containing data values
- a collection of relational database tables containing data values
- collections of "stuff" that contain both data values and documentation

As an extension of this definition, if we're going to deal with subsets
of collections of things, how do we identify the subset:
- enumerate the elements
- invoke a hierarchical description, where each level in the hierarchy
provides a list of the child elements
- invoke a mathematical graph, where a "query" produces a "list"
of elements that depends on where you start and how you choose
to traverse the graph

It would clearly be useful to get persistent names whose references
can survive in the face of hardware evolution (see Marowka, A., 2011:
Back to Thin-Core Massively Parallel Processors, Computer, 44,
No. 12, 49-54), software obsolescence (see Lyon, D., 2012:
The Java Tree Withers, Computer, 45, No. 1, 83-85), and proprietary
software (Norman, D. A., 2012: Yet Another Technology Cust:
Confusion, Vendor Wars, and Opportunities, 55, No. 2, 30-32 and
Olsen, K. A. and Maltzia, A., 2012: Interfaces for the Ordinary User,
Can We Hide Too Much?, CACM, 55, No. 1, 38-40).

Finally, testing for reliable scientific grounding of claims for
"linked data" is going to be a very messy business.  In much of
the current data production (whether by small teams doing
exploratory data analysis - where the number of data sources
is relatively small, or with the "industrial production" that satellite
teams and in situ data collection networks use - where the number
of data sources is fixed by the production design), keeping track
of sources is doable, but laborious.  In the case of "linked data",
the variety of sources may be much larger and the links may be
transient.  These complexities should be incorporated as early
in the design process for creating software to build such entities
as possible.

Bruce B.

On Wed, Feb 15, 2012 at 2:26 PM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:
> As part of our ongoing Identifiers Activity [1] I'd like to post a few
> thoughts (repeating/recapping some previous discussion and thinking
> about where to go next).
> We've made the case to identify datasets (whatever that means) with
> DOIs [2].  Now we want to look beyond them.
> I am a strong advocate for "Linked Data" [3,4] (and in particular
> "Linked Open Data") which is summarized with these ideas:
>    1. Use URIs as names for things
>    2. Use HTTP URIs so that people can look up those names.
>    3. When someone looks up a URI, provide useful information, using
>       the standards (RDF*, SPARQL)
>    4. Include links to other URIs. so that they can discover more
>       things.
> I really want to have a nice URI scheme where we can fit all the
> 'things' we're talking about (see PCCS [5]).  This will have all sorts
> of benefits I won't list here..  I want to concentrate on how to
> accomplish it.
> 1-3 are pretty easy, we just need to assign HTTP URIs and start using
> them.
> For 4, we really need persistency and adoption.  Cool URIs don't
> change [6] and we really want Cool URIs [7].
> In the data identifiers paper, we discussed the importance of
> persistency and described several schemes with methods of
> accomplishing that.  We want to assign permanent URIs to things and
> put a system and process in place to ensure that the URIs are always
> resolvable.  Using a hopefully somewhat permanent prefix
> (e.g. http://globalchange.gov or http://climate.data.gov) with a
> server that can always resolve all the assigned identifiers, either
> locally, or by delegating to another server I think we can accomplish
> persistency.  (At least in the short term ;-))
> Adoption is an interesting question.  These identifiers are most
> useful when we we use the same identifier when referring to the same
> thing. [8] But, we all want to say different things about things.  We
> can also do things like assigning separate identifiers to things, but
> asserting (owl:sameAs) equivalency between one identifier and another.
> More on this later..
> Some of the above referenced documents talk about a lot of guidelines
> for developing "Cool URIs".
> The UK government has gone a bit further with their data.gov.uk plan
> [9].  It seems like *some* of the things we are trying to assign
> identifiers to could fit nicely into an application of something
> similar to their scheme.
> Other thoughts?
> Curt
> [1] http://wiki.esipfed.org/index.php/Identifiers_Activity
> [2] http://dx.doi.org/10.1007/s12145-011-0083-6
> [3] http://www.w3.org/DesignIssues/LinkedData.html
> [4] http://linkeddata.org/
> [5]
> http://wiki.esipfed.org/index.php/Provenance_and_Context_Content_Standard
> [6] http://www.w3.org/Provider/Style/URI.html
> [7] http://www.w3.org/TR/cooluris/
> [8] http://www.w3.org/TR/webarch/
> [9]
> http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve

More information about the Esip-preserve mailing list