[Esip-preserve] Fwd: BiSciCol: Sneak Peeks, BiSciCol Style

Wed May 22 08:39:02 EDT 2013

Some interesting work...

-------- Original Message --------
Subject: 	BiSciCol: Sneak Peeks, BiSciCol Style
Date: 	Tue, 21 May 2013 13:12:11 -0500
From: 	Steve Aulenbach <saulenbach at usgcrp.gov>

BiSciCol: Sneak Peeks, BiSciCol Style
http://biscicol.blogspot.com/2013/05/sneak-peeks-biscicol-style.html?m=1

  Sneak Peeks, BiSciCol Style

Our blog has been quiet lately, as we coded and tested and waited out 
the cold, short days of winter and early Spring.  With Spring now firmly 
here, we are ready to give you the opportunity to directly test some 
fruits of that labor.  First, a quick review of where we have been. 
  BiSciCol, and all those interested in bringing biodiversity data into 
the semantic web, has been plagued by a chicken and egg problem.   In 
order for the semantic web to be a sensible solution, there needs to be 
a way to associate permanent, resolvable globally unique identifiers to 
specimens and their metadata.   There ALSO needs to be a 
community-agreed sematic framework for expressing concepts and how they 
link together.  You can't move forward without BOTH pieces and 
unfortunately the biodiversity community basically has had neither.  So 
BiSciCol decided to tackle both problems simultaneously.

The solution we developed leverages one thing that was already in place 
--- a community developed and agreed-upon biodiversity metadata standard 
called the Darwin Core.  We talked about how we have leveraged the 
Darwin Core in our last blog post, and how we have formalized Darwin 
Core "categories" (or classes), and derived relationships between them. 
  With this piece of the puzzle complete, we now have a working tool 
called the Triplifier.  The Triplifier takes a Darwin Core Archive, 
which contains some self-describing metadata about the document along 
with data, and converts those data to RDF.    Darwin Core Archives are 
particularly useful because all the data in such archives is already in 
a standard form.

Darwin Core Archives are available for download from sources such as the 
VertNet IPT (http://ipt.vertnet.org <http://ipt.vertnet.org/>), or the 
Canadensys IPT (http://data.canadensys.net/ipt/ 
<http://data.canadensys.net/ipt/>).  Just download any Darwin Core 
Archive you want, load the occurrence.txt file into the Triplifier 
(which we have yet to deploy to production yet, but try out the 
development server here: http://geomuseblade.colorado.edu/triplifier/ 
<http://geomuseblade.colorado.edu/triplifier/> ) via the "File Upload:" 
link, click "auto-generate project for" link and select Darwin Core 
Archive.  Load the file, get information about class and property 
structures, and then click "Get Triples" at the very end.  You should be 
able to then save the RDF.

So what does this all mean?  First, this is a working tool for creating 
Darwin Core data in RDF format.  It may not be perfect yet, but its been 
stress tested, and it does the job. This is a big step forward in our 
opinion. We are currently Triplifying a lot of Darwin Core Archives and 
putting all the results into a data store for querying.  Next blog post, 
we'll explain how valuable this can be, especially when looking for 
digitial objects linked to specimens, such as published literature, or 
gene sequences.

The other part of the chicken-egg problem is this persistent, and 
challenging, GUID problem.  Here we also have a working prototype of a 
service we are calling BCIDs, which are a form of identifier that is 
scaleable, persistent, and leverages community standards.  BCIDs are a 
form of EZIDs with a couple small tweaks to work for our community at 
scale.  It represents a lot of hard thinking by John Kunze and John 
Deck.  Here is the general idea: The BCID Resolution system resolves 
BCID identifiers that are passed through the Name-to-thing resolver 
(http://nt2.net/). All BCID group identifiers are registered with EZID, 
describing related categories of information such as Collecting Event, 
Occurrence, or Tissue. EZID then uses its suffix passthrough feature to 
pass the suffix back to the BCID resolver. At this point, a series of 
decisions <https://code.google.com/p/bcid/wiki/BCIDResolutionSystem> are 
made based on the identifier syntax to determine how to display returned 
content. Element-level identifiers, with registered suffixes in the BCID 
system, also containing targets, can be resolved to a user-specified 
homepage. Un-registered suffixes, or where there is no defined target 
associated with the identifier, or when machine resolution is 
specifically requested will return an HTML rendering of the identifier 
with embedded RDF/XML syntax describing the identifier. Machine 
resolution can be specifically requested to any identifier by appending 
a "?" to the identifier.  See the diagram below for extra-clarity.  And 
check out the BCID home-page <http://biscicol.org/bcid/> and BCID 
codepage <http://code.google.com/p/bcid/>.
<https://bcid.googlecode.com/svn/trunk/web/documents/BCID_flowChart.jpg>

How does this all work in practice?  Suppose we have group ID = 
ark:/21547/Et2 (resource=dwc:Event) and do not register any elements. 
Now, suppose someone passes in a resolution request for 
ark:/21547/Et2_UUID; the system will still tell you that this is some 
event (dwc:Event), date it was loaded, a title and if there is a DOI/ARK 
associated with it.  Now, suppose we decide to register those UUIDs 
associated with ark:/21547/Et2 and also provide web pages that have some 
HTML content to look at (targets) then we can show a nicely formatted, 
human readable page of the collecting event itself and some formatted 
human readable text (HTML).  However, what if we're a machine and we 
don't want to look at all the style sheets and extraneous, difficult to 
parse text; rather, we just want to know when this record was loaded and 
the resourceType (regardless if there is some target or not). This is 
where "?" comes in... if the "?" is appended on the end of the ark like: 
ark://21547/Et2_UUID? then we automatically get RDF/XML. Minimalist but 
predictable and a convention in use for EZIDs currently.

Soon you will be able to call the BCID service for any dataset, whether 
its in RDF format or not.  For datasets, one can register an ARK or DOI 
and associated metadata and for more granular elements, BCIDs will help 
assign the pass-through suffixes.  We think this represents a very 
elegant system for dealing with the very challenging problem of guids in 
the biodiversity informatics community.  It leverages existing tools and 
communities and it creates new ones needed for those involved in 
biocollections.   If you want to try creating and using BCIDs now, talk 
to us and we'll work with you to get this started.

We will be presenting more about BiSciCol in meetings this Summer, at 
iEvoBio (http://ievobio.org/) and TDWG 
(http://www.tdwg.org/conference2013 
<http://www.tdwg.org/conference2013>) , showing off what amounts to 
solutions that cover those chickens and eggs.   In the next post we'll 
finally link all of this up and show how it can be used for some neat 
discoveries.  Before winding down, BiSciCol owes a gigantic thanks to 
Brian Stucky who has put in a tremendous amount of effort developing the 
Triplifier.  He is off in Panama working on his dissertation research, 
and will be teaching classes next Fall.  We couldn't have come nearly as 
far as we have without him.

- Rob Guralnick, Nico Cellinese, Tom Conlin, John Deck, and Brian Stucky

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20130522/ac864e1a/attachment-0001.html>