[Esip-preserve] Fwd: BiSciCol: Sneak Peeks, BiSciCol Style
Curt Tilmes
Curt.Tilmes at nasa.gov
Wed May 22 08:39:02 EDT 2013
Some interesting work...
-------- Original Message --------
Subject: BiSciCol: Sneak Peeks, BiSciCol Style
Date: Tue, 21 May 2013 13:12:11 -0500
From: Steve Aulenbach <saulenbach at usgcrp.gov>
BiSciCol: Sneak Peeks, BiSciCol Style
http://biscicol.blogspot.com/2013/05/sneak-peeks-biscicol-style.html?m=1
Sneak Peeks, BiSciCol Style
Our blog has been quiet lately, as we coded and tested and waited out
the cold, short days of winter and early Spring. With Spring now firmly
here, we are ready to give you the opportunity to directly test some
fruits of that labor. First, a quick review of where we have been.
BiSciCol, and all those interested in bringing biodiversity data into
the semantic web, has been plagued by a chicken and egg problem. In
order for the semantic web to be a sensible solution, there needs to be
a way to associate permanent, resolvable globally unique identifiers to
specimens and their metadata. There ALSO needs to be a
community-agreed sematic framework for expressing concepts and how they
link together. You can't move forward without BOTH pieces and
unfortunately the biodiversity community basically has had neither. So
BiSciCol decided to tackle both problems simultaneously.
The solution we developed leverages one thing that was already in place
--- a community developed and agreed-upon biodiversity metadata standard
called the Darwin Core. We talked about how we have leveraged the
Darwin Core in our last blog post, and how we have formalized Darwin
Core "categories" (or classes), and derived relationships between them.
With this piece of the puzzle complete, we now have a working tool
called the Triplifier. The Triplifier takes a Darwin Core Archive,
which contains some self-describing metadata about the document along
with data, and converts those data to RDF. Darwin Core Archives are
particularly useful because all the data in such archives is already in
a standard form.
Darwin Core Archives are available for download from sources such as the
VertNet IPT (http://ipt.vertnet.org <http://ipt.vertnet.org/>), or the
Canadensys IPT (http://data.canadensys.net/ipt/
<http://data.canadensys.net/ipt/>). Just download any Darwin Core
Archive you want, load the occurrence.txt file into the Triplifier
(which we have yet to deploy to production yet, but try out the
development server here: http://geomuseblade.colorado.edu/triplifier/
<http://geomuseblade.colorado.edu/triplifier/> ) via the "File Upload:"
link, click "auto-generate project for" link and select Darwin Core
Archive. Load the file, get information about class and property
structures, and then click "Get Triples" at the very end. You should be
able to then save the RDF.
So what does this all mean? First, this is a working tool for creating
Darwin Core data in RDF format. It may not be perfect yet, but its been
stress tested, and it does the job. This is a big step forward in our
opinion. We are currently Triplifying a lot of Darwin Core Archives and
putting all the results into a data store for querying. Next blog post,
we'll explain how valuable this can be, especially when looking for
digitial objects linked to specimens, such as published literature, or
gene sequences.
The other part of the chicken-egg problem is this persistent, and
challenging, GUID problem. Here we also have a working prototype of a
service we are calling BCIDs, which are a form of identifier that is
scaleable, persistent, and leverages community standards. BCIDs are a
form of EZIDs with a couple small tweaks to work for our community at
scale. It represents a lot of hard thinking by John Kunze and John
Deck. Here is the general idea: The BCID Resolution system resolves
BCID identifiers that are passed through the Name-to-thing resolver
(http://nt2.net/). All BCID group identifiers are registered with EZID,
describing related categories of information such as Collecting Event,
Occurrence, or Tissue. EZID then uses its suffix passthrough feature to
pass the suffix back to the BCID resolver. At this point, a series of
decisions <https://code.google.com/p/bcid/wiki/BCIDResolutionSystem> are
made based on the identifier syntax to determine how to display returned
content. Element-level identifiers, with registered suffixes in the BCID
system, also containing targets, can be resolved to a user-specified
homepage. Un-registered suffixes, or where there is no defined target
associated with the identifier, or when machine resolution is
specifically requested will return an HTML rendering of the identifier
with embedded RDF/XML syntax describing the identifier. Machine
resolution can be specifically requested to any identifier by appending
a "?" to the identifier. See the diagram below for extra-clarity. And
check out the BCID home-page <http://biscicol.org/bcid/> and BCID
codepage <http://code.google.com/p/bcid/>.
<https://bcid.googlecode.com/svn/trunk/web/documents/BCID_flowChart.jpg>
How does this all work in practice? Suppose we have group ID =
ark:/21547/Et2 (resource=dwc:Event) and do not register any elements.
Now, suppose someone passes in a resolution request for
ark:/21547/Et2_UUID; the system will still tell you that this is some
event (dwc:Event), date it was loaded, a title and if there is a DOI/ARK
associated with it. Now, suppose we decide to register those UUIDs
associated with ark:/21547/Et2 and also provide web pages that have some
HTML content to look at (targets) then we can show a nicely formatted,
human readable page of the collecting event itself and some formatted
human readable text (HTML). However, what if we're a machine and we
don't want to look at all the style sheets and extraneous, difficult to
parse text; rather, we just want to know when this record was loaded and
the resourceType (regardless if there is some target or not). This is
where "?" comes in... if the "?" is appended on the end of the ark like:
ark://21547/Et2_UUID? then we automatically get RDF/XML. Minimalist but
predictable and a convention in use for EZIDs currently.
Soon you will be able to call the BCID service for any dataset, whether
its in RDF format or not. For datasets, one can register an ARK or DOI
and associated metadata and for more granular elements, BCIDs will help
assign the pass-through suffixes. We think this represents a very
elegant system for dealing with the very challenging problem of guids in
the biodiversity informatics community. It leverages existing tools and
communities and it creates new ones needed for those involved in
biocollections. If you want to try creating and using BCIDs now, talk
to us and we'll work with you to get this started.
We will be presenting more about BiSciCol in meetings this Summer, at
iEvoBio (http://ievobio.org/) and TDWG
(http://www.tdwg.org/conference2013
<http://www.tdwg.org/conference2013>) , showing off what amounts to
solutions that cover those chickens and eggs. In the next post we'll
finally link all of this up and show how it can be used for some neat
discoveries. Before winding down, BiSciCol owes a gigantic thanks to
Brian Stucky who has put in a tremendous amount of effort developing the
Triplifier. He is off in Panama working on his dissertation research,
and will be teaching classes next Fall. We couldn't have come nearly as
far as we have without him.
- Rob Guralnick, Nico Cellinese, Tom Conlin, John Deck, and Brian Stucky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20130522/ac864e1a/attachment-0001.html>
More information about the Esip-preserve
mailing list