[Esip-preserve] Some Thoughts on OPM

Bruce Barkstrom brbarkstrom at gmail.com
Fri Dec 10 16:13:56 EST 2010


Your suggestions may be a start.  My OID hierarchy (or whatever
you might want to transmogrify that into in terms of a hierarchical
html naming scheme) would organize the collection in an archive
starting with the archive identifier (which should probably be registered),
followed by the original data producer categories, followed by the
generic collection of files (those within the EOSDIS community
might call this the ESDT level; my publications call it the Data Product
level - so we'll probably need to deal with aliases), followed by the
collections based on data source (which is what I've described as
the Data Set level - I'm not sure what the other names would be),
followed by Data Set Versions, and so on down to individual files.
I think the idea is similar to yours.  It is clear that we need some
way to break down the collection into a hierarchy.

I'll also note that the OAIS RM has a collection schema that turns
out to be recursive so that what we want to describe as a collection
hierarchy can also have a context attached - which means that it's
fairly easy to set up a template for what each level will contain.

At this point, we probably need to seek a fairly broad sample of
collection organizations - not just the ones we've identified, but
a real cross-section of cases that probe the full range of collection
behavior.  You might find it interesting to try out the way you'd organize
Ruth's collection of photos, as well as the Hurricane Ike Damage
Assessment aerial survey (a collection of high res digital photos
taken at nearly the same time).  Likewise, there's some things
from NCDC with continuously updated records of temperature and
precipitation - although the file structure is something created in
FORTRAN.  Keeling's CO2 monthly data is a simple text file
(again with a strong flavor of FORTRAN) - no versions, no updates
beyond 2003, and so on.

Let's keep up the discussion.

Bruce B.

On Fri, Dec 10, 2010 at 12:07 PM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> On 12/10/2010 11:51 AM, Bruce Barkstrom wrote:
>
>> Eventually, we're going to have to do some thinking about the
>> scaling that goes with this approach.  As far as I can tell, the
>> scaling for traversing the graph is still linear with the number of
>> nodes.  If all the granules in ESDIS get included, we're going to
>> have several hundred million items, including files and jobs - not
>> to mention the possibility of subsets (fragments) of files.
>>
>
> You're right, of course.  This will be a big challenge..
>
> Sometimes I think it would be great to have a huge triple store that
> just pulls in everything we care about and can query it directly with
> SPARQL, but I think that isn't feasible (or at least won't be for some
> time).
>
> I think we can partition nicely along the
> Dataset = { Collection, ESDT }
> boundaries though.  (Collection still bothers be though -- it isn't as
> concrete as the ArchiveSet model we use internally)
>
> Each Dataset has a "home" -- an Archive responsible for its curation
> and stewardship, they could offer a URL into which the persistent
> identifier for that Dataset (DOI) will point, and they could also
> offer (or point to elsewhere) a SPARQL end point with the graph of
> related nodes.  When you get to a point where you are referring to
> another dataset owned by another archive, you hop over to their SPARQL
> end point and continue the query.
>
> As a single archive grows bigger and bigger, they can just paritition
> internally along Dataset boundaries as much as needed, offering
> multiple databases.
>
> Getting back to "Collection", we need the ability to broaden it beyond
> a single archive.  Currently every collection of a specific ESDT is
> always owned by the same archive (if the old ones are even kept at
> all, which is another issue) For this scheme to be scalable, we need
> the ability for other archives to handle the same types of data,
> Whether they must change the ESDT, or have a controlled, extended
> namespace for Collection, or something different.
>
>
> Curt
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101210/39ca296a/attachment.html>


More information about the Esip-preserve mailing list