[Esip-preserve] Fwd: Call for review of GBIF LSID-GUID task group report

Alice Barkstrom alicebarkstrom at verizon.net
Wed Aug 19 09:36:30 EDT 2009

OK - here's some notes from reading this at my 
breakfast session with the Internet:
Note on Identifiers

Character Strings: ? Bar codes on physical 
objects; Fingerprints/Retinal scans for people

Persistence: Uniqueness and 
non-reusability.  This leads to an archive policy of never
deleting items once they're placed in the 
archive.  However, if an archive discovers that
the original stored object was erroneous, it may 
be entirely appropriate to remove that
artifact and substitute a correct one.  Likewise, 
we may want to deal with transformational
migration - recognizing that the transformed 
object is practically the same under some
operations.  Another complication arises with 
replication.  Is a digital copy of an object
that is stored in a different place going to need a new identifier?

Model of Information Flow [p. 5ff]  The model 
suggested identifies the primary flow being
from producers to users and back, with the 
assumption being that the primary mechanism
for feedback is e-mail.  To me this sounds like 
an approach that assumes that the producers
remain an active group.  So - what happens when 
we introduce two additional communities:
1.  The archives become active recipients of 
feedback and adjust their presentation of
material to feedback from communities outside the 
original producer.  This is a bit like
what happens to art (paintings, primarily), in 
which a shift in the art critics opinion that
suggests a particular painting is fake produces a 
change in the display.  In the biodiversity
community, this might happen (I suspect) if there 
is a disputed identification of a particular
2.  There is a new group of external reviewers 
who become institutionalized and serve
as a reference organization outside of the 
original loop between the producer and the
archive.  In climate, we seem to have this happening with IPCC.

It may be useful to think about how the 
information flow model would work if the
items in the collection undergo transformational 
migration as the collection changes.
Also, how would the identifiers work when the 
community starts building composite
works in which a data producer extracts a subset 
from data set A and another subset
from (a very different) data set B, applies some 
complex process to produce data set
C and then places C in an archive.  C would seem 
to deserve a new identifier.  However,
I've been the recipient (or victim) of groups 
that would take data, fill in unobserved
parts of the records and "republish" this as the 
same data.  Or how about an extension
of the LOCKSS multi-copy replication approach?

Section 5.1  More complex scenarios are possible 
- particularly if user behavior is
tracked.  It is possible to create an abstract 
model of user ordering behavior in which
the choices made by a user in selecting a 
particular object are equivalent to characters,
so that the search path a user navigates becomes 
a string.  It would be possible to
cluster strings and correlate them with what 
users actually order, so as to be able to
evolve more efficient search approaches that 
actually learn from user behavior.  This
is similar to what Google and other search 
engines do - although privacy concerns
make this a rather interesting prospect.

Section 5.2  In a broader sense, the concerns here are similar to those in the
humanities and lead to fairly complex collections 
of annotations (this word is a
synonym for that word, but only sounds the same as another).  It might be
interesting to check up on humanities annotation languages (yet another example
of the malleability of XML).

Section 5.3  The diagram appears to leave out the 
ties of biodiversity, physical
climate, underlying biogeoCHEMICAL information 
(e.g. soil type), and evolutionary
history.  In addition, we are headed into linkages with economics and regional
economic flows that evolve structurally over 
time.  The data sources for weather
and oceanography are much larger and change at a much higher frequency than
do the distribution of species and individual 
plants and animals in an ecosystem.
Such ties are going to become much more important as the climate changes
rapidly.  An interesting work along this line is 
"Vegetation of Wisconsin", which
I believe is a classic that used some historical 
information (e.g. "this corner of
the property is located 20 feet N by NW of a large black oak") to reconstruct
the pre-human ecosystem of the state.

Always treat assumptions about "economies of scale" with many grains of salt.
How much human effort is it going to take to link bioinformatics with physical
climate data?  Would that human effort be available after the assessment work
(think IPCC) is done?  This assertion could use a 
real economic model that could
lay out why the authors think that the 
foreseeable benefits of the linkages outweigh
the costs of developing them.  [Of course, there 
will always be a few individuals
who will ignore cost/benefit stuff because it's 
interesting.  Some of them discover
new worlds and become famous - or maybe rich.  Many fail.  Some guidance on
judicious choices in this area would be useful.]

Section 5.4  The model here is very tied to current technologies.  Having lived
through the outbreak of the Web and the extremely rapid evolution from http
through XML and on to RDF, I wonder if this description would look the same
in ten years.  Of course the current technology needs to support the initial
exploration of ideas to give us a notion of 
what's needed.  However, it would be
wise to avoid casting the statement of "Requirements" in terms of ideas that
will be outmoded in the foreseeable future.

It would be helpful to see if the "Requirements" could be formulated in a more
abstract form.  For example, RDF is a useful 
representation of links in a mathematical
graph.  RDF is probably easier to store in Triple 
Store data tools than in relational
databases - and will work better.  What happens if Triple Stores never take off
and become commercially viable?  Or, what happens if Object-Relational
databases win commercially?  Will those kinds of commercial product issues
break the "Requirements" in Section 5.4?

Section 6.  Only two kinds of identifiers.  This 
needs to expand some.  Generically,
there appear to be three basic families:
a.  GUIDs and UUIDs - randomly assigned unique identifiers
b.  DOI/OID - hierarchical strings using digits 
and a separator (usually a period)
c.  URL's - hierarchical strings that use a wider range of characters and
a separator (including PURLs, ARK, and so on).

Are any of these inapplicable to the bioinformatics problems?  Are any of them
particularly advantageous?  If they are advantageous, why?

Section 7 [The advertising section]

Recommendation 1 sounds like an appeal to the funders.  At my age, such
appeals for "leadership" are the clubs different gangs take into their street
rumbles.  I think I've heard similar statements from NASA and NOAA about
"climate" for years.  Don't have enough background to know who the
competition is.

On recommendation 2, "branding" is likely to lead to some interesting struggles
between the GBIF community and various agency managers who want to put
their logos on presentations.

On recommendation 3 - clearly the group writing these recommendations believes
RDF is a primary tool.  One issue that arises in the more general community,
where the linkages with physical climate and 
soils are quite important is whether
or not the very large volumes of data and 
metadata are going to transition to RDF
and if so, how and when.  The NASA EOS repositories, for example, have metadata
designs that have about a decade or two of 
institutional inertia.  NOAA's archives
and data centers are more fragmented, but also 
have a similar time scale of change.
It might be helpful to create a bit more detailed 
plan to show how this community
proposes to interact with the evolution of these institutional resources.

A key issue in the Recommendation that follows is whether GBIF intends
to act in a centralized or a decentralized manner.  The emphasis on "GBIF"
in this sounds like the authors are visualizing what they're going to do as
creating an organization that will provide services in a centralized manner.
Would the recommendations look different if they were to be created by
a federation of institutions?  The LOCKSS model for information preservation
is predicated on the notion that resources are the most fragile part of any
effort.  Distributed governance helps make the structure less fragile and
often reduces costs.  Recommendation 14 does recognize these influences.
Looks to me like an unresolved issue is present in terms of a funding or
business model.  Recommendation 14 is stated passively, by the way,
indicating no responsible subject for the verb "to be funded".  Interesting

These are not edited for judiciousness.  Take the comments with however
much salt you want.

Bruce B.

At 09:39 PM 8/18/2009, Wilson, Bruce E. wrote:
>For information and comment back to GBIF.
>Bruce E. Wilson (wilsonbe at ornl.gov)
>Environmental Sciences Division
>Oak Ridge National Laboratory
>(office) +1-865-574-6651
>---------- Forwarded message ----------
>From: Louise Scharff (GBIF) <lscharff at gbif.org>
>Date: Tue, Aug 18, 2009 at 3:34 AM
>Subject: Call for review of GBIF LSID-GUID task group report
>To: gb.gbif at ig.circa.gbif.net, 
>Garry.Jolley-Rogers at csiro.au, pdevries at wisc.edu, 
>katharina.schleidt at umweltbundesamt.at,
>wlcw at tfri.gov.tw, rschenk at mbl.edu, 
>palanisamyg at ornl.gov, ahampson at usgs.gov, 
>Orrellt at si.edu, deepreef at bishopmuseum.org, 
>Donald.Hobern at csiro.au, jones at nceas.ucsb.edu, 
>hlapp at nescent.org, n.paskin at doi.org
>Cc: "Eamonn O Tuama (GBIF)" <eotuama at gbif.org>, lgtg at lists.gbif.org
>Dear Participants and Friends of GBIF,
>The LSID-GUID Task Group (LGTG) was convened by GBIF earlier this year
>(http://www.gbif.org/News/NEWS1243443852) to 
>provide advice on persistent identifiers for 
>biodiversity informatics. The principal 
>objective of the group is to provide 
>recommendations and guidelines on deployment of 
>Life Science Identifiers (LSIDs) and other 
>Globally Unique Identifiers (GUIDs) on the GBIF 
>network with particular reference to the 
>potential role of GBIF as a stable, long term 
>provider of GUID resolution services. GBIF 
>recognises that input by Participants is an 
>essential and valuable part of the process both 
>in nominating members of the task group and in 
>providing feedback to output of the task group.
>The LGTG has now released a draft of their 
>report (attached). You are invited to review the 
>draft document and provide feedback to the task 
>group on or before the 7th September. Please 
>send your comments/input directly to the task 
>group mailing list (lgtg at lists.gbif.org) (note:
>you do not need to subscribe to the list).
>Yours sincerely,
>(on behalf of the LGTG members)
>Éamonn Ó Tuama
>Éamonn Ó Tuama, M.Sc., Ph.D. (eotuama at gbif.org),
>Senior Programme Officer, Inventory, Discovery, Access (IDA),
>Global Biodiversity Information Facility Secretariat,
>Universitetsparken 15, DK-2100 Copenhagen Ø, DENMARK
>Phone:  +45 3532 1494; Fax:  +45 3532 1480
>Louise Scharff
>Programme Assistant
>GBIF Secretariat
>Universitetsparken 15
>DK-2100 Copenhagen
>Tel.: +45 35 32 14 84
>Fax: +45 35 32 14 80
>E-mail: lscharff at gbif.org
>URL: www.gbif.org
>Matthew B. Jones
>Director of Informatics Research and Development 
>National Center for Ecological Analysis and Synthesis (NCEAS) UC Santa Barbara
>jones at nceas.ucsb.edu                       Ph: 1-907-523-1960
>Esip-preserve mailing list
>Esip-preserve at rtpnet.org

More information about the Esip-preserve mailing list