[Esip-preserve] DataCite to require "landing pages"

Curt Tilmes Curt.Tilmes at nasa.gov
Thu Mar 1 13:30:58 EST 2012


On 03/01/2012 12:16 PM, Greg Janée wrote:
> On Feb 29, 2012, at 2:34 PM, Mark A. Parsons wrote:
>> Landing pages should ultimately be both human and machine readable.
>
> I've always thought this would be the best of both worlds, as both
> humans and programmatic clients can then get representations of
> resources that they can do something with.  But we seem to be
> hampered by the lack of an agreed-upon technical approach.  Is this
> something that ESIP might like to look at?  Two approaches that have
> been suggested:
>
> 1. Content negotiation.  Clients use HTTP's Accept header mechanism to
> request a specific representation of a resource: RDF, OAI-ORE, etc.
> DataCite has put together an alpha version of this concept at http://data.datacite.org/
> .  I don't think there are (yet) any guidelines as to what resource
> types return what information in what ways.

More information here:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec12.html
http://www.w3.org/TR/webarch/#def-coneg
http://www.w3.org/QA/2006/02/content_negotiation.html

> 2. Identifier "inflections".  This is an idea proposed by John Kunze
> at CDL.  A client can request a specific representation by adding a
> syntactic cue to the identifier.  For example, it's already part of
> the ARK specification that appending a question mark (?) to an
> identifier returns metadata; perhaps appending a slash (/) requests
> a "landing page" or other human-oriented experience as opposed the
> resource directly.


RFC 5988 (http://tools.ietf.org/html/rfc5988) has a way to add an
extra "Link" http header to describe the relationship between URIs.

We could use some special link between a "URI to a web page about some
data" and "URI to retrieve the data".  (Maybe that is already
defined?)  This seems related to some of the discovery work with their
Atom 'link's.


One issue people talk about a lot with identifiers and landing pages
is distinguishing the URI for a thing and the URI for a web page of
information about a thing.

Once you start asserting facts, you have things like:

"FOO Instrument" "created by" "Fred Smith"
"FOO Instrument" "created on" "1999"
"FOO Instrument Web Page" "created by" "Jane Doe"
"FOO Instrument Web Page" "created on" "2012"

you need two distinct identifiers (URIs) for those two distinct
things.


Suppose I use this identifier for the FOO Instrument:

     http://somewhere/instrument/FOO

and I use this identifier for the web page about FOO:

     http://somewhere/instrument/FOO.html

Of course when I resolve the former, I still want to see the landing
page of information.

Some people simply redirect from http://somewhere/instrument/FOO ->
http://somewhere/instrument/FOO.html


I like the visual distinction of a redirect using two distinct URLs,
rather than simply returning different information based on the Accept
header.  (which we could also do) (It's also easier for me to test/try
things out by adding ".html" or ".rdf" into the URL line on my web
browser than it is to play with changing the Accept header.)


Ok, suppose I have a dataset (uh, I mean structured collection of
data), identified by

    doi:10.001/FOOL1B.v001

which I can map to a useful URI:

    http://dx.doi.org/10.001/FOOL1B.v001

Whenever someone resolves that URI, I want to give them something
useful.  The question is what do I give them?  If FOOL1B.v001 is a
collection of dozens (hundreds? thousands?) of granules, it doesn't
really make sense point directly to a single one of them.

DOI allows us to point that DOI to anywhere we like.  Best (uh, good?
recommended?) practice seems to be to point it to some sort of landing
page, perhaps:

     http://some.data.center/FOO/FOOL1B/FOOL1B.v001.html

On that page, you get all the information about dataset, and perhaps
links to the actual data, or at least an ordering interface.
(e.g. http://nsidc.org/data/mod10_l2v5.html)


Now think about an RDF representation of information about that
dataset.  There is structured information on that page, so we should
be able to express that information using something like RDF (or XML,
JSON, etc.) and get it back directly via content negotiation.


If we resolved the URI directly, ourselves, it's pretty easy, you just
redirect to the RDF page about that URI, but using DOI, it seems like
the redirection happens before it hits out page, right?

(How does the dx.doi.org resolver work?  Is there a way to log
multiple redirects with them?  Or does it happen at a level above the
specific DOI, so my own resolver can get in there?)


Anyway, it seems to me that if you requested

    http://dx.doi.org/10.001/FOOL1B.v001

with

    Accept: application/rdf+xml

doi.org would still redirect to

    http://some.data.center/FOO/FOOL1B/FOOL1B.v001.html

(Is that right?)

Which you would again request, still with the RDF Accept, and the web
server at some.data.center could then redirect you to

    http://some.data.center/FOO/FOOL1B/FOOL1B.v001.rdf

and dump out that structured information in rdf.


Now look back at the instrument identifiers above, where

http://somewhere/instrument/FOO is re-directed to the landing page
http://somewhere/instrument/FOO.html

What URL gives me RDF information about FOO, and what gives me RDF
information about the web page of information about FOO?

What comes from URL http://somewhere/instrument/FOO.rdf?


I also like the 'inflections' described about, especially wtih about
the multiple 'layers' of aggregation (like Ruth has been working with)
-- I think we need consistent, algorithmic ways to express those
aggregations clearly.


Sorry about the rambling... there is a lot here we can discuss.  There
are so many different ways to do some of these things, I'm kind of
struggling to figure out the best ways for our domain.  I could
definitely use some guidance and think this is a place ESIP could come
together and capture some guidelines/recommended practices.

Curt


-- 
Curt Tilmes, Ph.D.
U.S. Global Change Research Program
1717 Pennsylvania Avenue NW, Suite 250
Washington, D.C. 20006, USA

+1 202-419-3479 (office)
+1 443-987-6228 (cell)



More information about the Esip-preserve mailing list