[Esip-preserve] Some Topics for This Afternoon's Telecon on Collection Structure

Mon Jun 17 08:53:32 EDT 2013

I've attached a plain text file with some thoughts about
distinguishing one collection from another.

Bruce B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20130617/9617a68b/attachment.html>
-------------- next part --------------
Discussion Points on (Metadata) Attributes for Identifying Collections

I think it would be helpful to have a discussion about attributes that
distinguish one collection from others in a similar group.  A useful
procedure is to provide equal time to each participant and to ask each
participant to voice an opinion about the topic.  I've used this approach
many times with science teams.  It's been quite effective at understanding 
different positions and reaching a consensus.

For this particular topic, it seems useful to start with the NOAA Emergency 
Response Imagery archive.  The entry point for obtaining data is a Web page 
that lists storm damage collections.  Each storm is identifiable by the storm 
name (Hurricane Katrina; Tuscaloosa Tornado), storm date (2005; 2011), and 
location (New Orleans, LA; Tuscaloosa, AL).  Any of these three attributes 
appears sufficient for identifying an image collection.

The requested "Originator" for citations is
"Department of Commerce (DOC), National Oceanic and Atmospheric Administration (NOAA), 
National Ocean Service (NOS), National Geodetic Survey (NGS), Remote Sensing Division".
This attribution request appears on the Web page with the standard NGDC metadata 
information.  I assume that the term "Originator" is equivalent to "Author" 
or "Responsible Party".  I've not seen a particular individual credited, although 
there is an e-mail address on the Web pages giving the "Webmaster" and the "Responsible
Official".  

From my perspective, all of the images in this imagery collection have the same "Originator".  
It follows that an "Originator" (or "Responsible Party" or even "Author") field cannot 
distinguish one collection from another.  It cannot even distinguish one image from another.  
This term (or an alias) could distinguish a NOAA collection of storm damage 
images from one maintained by NASA or ESA or some other agency.  Of course,
we could move these collections up one level in the hierarchy and use the
"Originator" field to distinguish NOAA's collections from those of other agencies. 

A related issue is how to define categories.  One approach is to develop a
set of controlled vocabulary terms.  If an archive or standards group were
to do that, we should discuss how a search site could inform the users of
that vocabulary so that users could use that vocabulary to advantage in 
formulating their queries.  A second approach is to develop a taxonomy
procedure that produces a functional distinction between different categories.
This approach is similar to the one biologists use to classify species of 
plants or animals.

The taxonomic procedure would ideally be stated in algorithmic terms.
For example, for the storm collection example above could describe a
procedure in these terms:

Preconditions:
   The system has three testable attributes: Storm_Name, Storm_Date, and 
      Storm_Location that the system can use to identify to which Storm_Collection
      a particular collection of images belongs to.
   User input selects one or more of the three testable attributes (maybe by
      a keyword input or maybe by using a radio box form or maybe by choosing
      a link to a Web page for a selected collection)
Selection Algorithm:
   for Storm_Collection in range of Storm_Collections loop
      if (User_Selected_Storm Name = Storm_Collection.Storm_Name)
        OR (User_Selected_Storm_Date = Storm_Collection.Storm_Date)
        OR (User_Selected_Storm_Location = Storm_Collection.Storm_Location) then
         Storm_Collection is the selected collection;
         exit from loop;
      end if;
   end loop;

There are a couple of features of the algorithmic approach that seem to me
to be helpful:
a.  It may reduce some of the semantic heterogeneity difficulties (case sensitivity,
    spelling differences, vocabulary differences, etc.) that go with queries
    that use only textual keywords.
b.  It can be applied to distinctions that are not textual, such as probabilistic
    tests of statistical significance between two functions.  These distinctions
    arise when a decision maker wants to know whether two data sources have the
    same error distributions or whether one has a lower dispersion of uncertainty.
    They are also likely to arise when the selection of an object depends on making
    sure that that object's sampling pattern is the same as the pattern for another
    object.

There are two issues that I think we should discuss:

1.  Do we need to have metadata terms that cannot help users find particular
collections or objects because the terms do not distinguish between collections
or objects in a useful way?  A particular case is whether the "Originator" (or
"Author" or "Responsible Party") is always useful.  However, the issue is more
general than that one instance.

2.  Should we develop taxonomic procedures that provide functional definitions
for classifications rather than relying on keyword queries?

I'll suggest that our discussion on each of these issues start by having each 
participant identify both key advantages and key disadvantages they see in the 
different options.  After getting those identified, we can move on to discuss
how each participant leans on the issues.