[Esip-preserve] Identifiers

Thu Feb 16 09:42:08 EST 2012

On 02/15/2012 04:38 PM, Bruce Barkstrom wrote:
> As another curmudgeonly note, I think it would be useful to provide
> some use cases about the motivation of "linked data creators".

The W3C Provenance Incubator captured some general use cases
here:

     http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases

The W3C Provenance Standards will include ways to model and represent
those types of information:

http://www.w3.org/blog/SW/2012/01/11/feedback-welcome-an-overview-of-the-provenance-prov-family-of-specs/

He also has a nice blog post here illustrating some very simple
statements you can make about provenance:

     http://www.w3.org/blog/SW/2011/10/23/5-simple-provenance-statements/

If look at those examples, you'll see our Identifiers issue.

We can use, for example, the "prov:wasDerivedFrom" relationship to
describe that a dataset (or a graph, a figure, a paper, etc.)  was
derived from another dataset, but we need a way to refer to the things
that participate in that relationship.

Paul's example looks like this:

----------------------------------------------------------------------
@prefix ex: <http://www.example.org/>.
@prefix prov: <http://www.w3.org/ns/prov-o/>.

ex:post prov:wasDerivedFrom ex:report.
----------------------------------------------------------------------

Applying the prefixes, he is using these identifiers for his example
things:

http://www.example.org/post and http://www.example.org/report

Following his example, if I want to assert that the collection 5
"MODIS/Terra Snow Cover 5-Min L2 Swath 500m" dataset (MOD10_L2.5) was
derived from the "MODIS/Terra Calibrated Radiances 5-Min L1B Swath
500m" dataset (MOD02HKM.5), how do I do that?

I could imagine:

http://globalchange.gov/dataset/MOD10_L2.5 prov:wasDerivedFrom
http://globalchange.gov/dataset/MOD02HKM.5

or more succinctly:

----------------------------------------------------------------------
@prefix gcis: <http://globalchange.gov/>.
@prefix prov: <http://www.w3.org/ns/prov-o/>.

gcis:dataset/MOD10_L2.5 prov:wasDerivedFrom gcis:dataset/MOD02HKM.5.
----------------------------------------------------------------------

Note here I'm expressing the provenance at the dataset level.  Within
the processing systems and DAACs, the can use "wasDerivedFrom" to
express a distinct (but compatible) perspective of the same
relationship at a different level of granularity:

----------------------------------------------------------------------
http://some_nsidc_place/granule/MOD10_L2.A2012046.0125.005.2012046083304.hdf
prov:wasDerivedFrom
http://some_modaps_place/granule/MOD02HKM.A2012046.0125.005.2012046081806.hdf
----------------------------------------------------------------------

For now (for the GCIS) I'm going to stick to the dataset level -- not
that the other levels aren't very important -- Ruth,Hook,Chris,
etc. are definitely spending time working on those.

I just think we can more easily find some common ground at a higher
level.  (once we get this solved, we can do the other later... :-)

I also want to label/identify all the other stuff that I need to
express relationships like these:

----------------------------------------------------------------------
gcis:instrument/MODIS "has name" "Moderate Resolution Imaging 
Spectroradiometer"
gcis:instrument/MODIS "associated with" gcis:organization/NASA

gcis:instrument/MODIS/Terra "flew on" gcis:spacecraft/Terra
gcis:instrument/MODIS/Aqua "flew on" gcis:spacecraft/Aqua

gcis:dataset/MOD10_L2.5 "has name" "MODIS/Terra Snow Cover
5-Min L2 Swath 500m"

gcis:dataset/MOD10_L2.5 "has collection" "5"
----------------------------------------------------------------------

Setting aside the prefix for now, is "instrument/{some controlled
namespace identifier for the instrument}" the right way to construct a
URI for an instrument like MODIS?  Should we use "sensor"?

How about "spacecraft"?  Should that be "platform"? or "observatory"?
or some other more generic term?

Then we get into things like

gcis:project/MODIS "has team leader" gcis:person/Michael_King
gcis:person/Michael_King "has name" "Michael King"
gcis:person/Michael_King "affiliated with"
gcis:organization/UniversityOfColorado

Is "person/{full name}" the right way to make a URI for a person? What
do I do if their name changes?  What about organization?

ESIP also needs identifiers for people for some of their
databases.

If they make, for example, an "esipfed:person/Michael_King", then I
(or they, or both) can add an additional:
gcis:person/Michael_King owl:sameAs esipfed:person/Michael_King

It could be that GCIS assigns enough identifiers for things like
people that ESIP Fed could simply adopt and reuse them, then we don't
need all the "sameAs" links.

> On the other hand, one would expect a normal scientific
> investigation to have to be concerned about the uncertainty
> in the data products.

Uncertainty is one aspect of "Information Quality" being
addressed by the ESIP Information Quality cluster:

     http://wiki.esipfed.org/index.php/Information_Quality

> So - when we deal with "linked data products" for Earth sciences,
> are we requiring the data producer to create and document the math
> model and provide evidence that it is a reasonable statement of the
> level of confidence in the measurements recorded in the data
> product?

We're not requiring anything (ESIP is not in a position to require
anything).

The IQ cluster is working on formats/standards/vocabularies/
ontologies/ etc. that could be used by producers to help them capture
and represent such information once captured.  They may come up with
recommendations for data producers to create/document stuff things at
some point.

Once that effort gets further along, it may be something we want to
include in things like PCCS and work on identifiers/relationships to
other stuff, etc. but I think it is premature for now.  Again, it
isn't that such things aren't important, we just need to limit our
scope for now.

The GSFC/RPI project "Multi-Sensor Data Synergy Advisor"
(http://tw.rpi.edu/web/project/MDSA) is working in this area too.

> There are lots of variants and levels of fulfillment to the GUM
> standard (yes, it's ISO).  What role does the expectation that
> "linked data" will deal with scientific investigations play in the
> kinds of products this approach encourages?

"linked data" just provides a model for representing that data and
says that people who talk about the same thing should refer to it in a
common, standards based way.  That really boils down to "use URIs as
identifiers for things", and "use the same URI as everybody else when
talking about the same thing" (for both resources and properties). A
very simple concept.

The hard part is the vocabularies for those properties, and coming up
with good identifiers that people will use so we can associate things
in such a way that everyone knows what we are talking about.  That's
where I think this group can contribute for our domain.

Curt