[Esip-preserve] Some Thoughts on OPM

Bruce Barkstrom brbarkstrom at gmail.com
Sat Dec 11 10:53:09 EST 2010


I'll admit to an additional problem - perhaps a defect in my own
understanding.
I've never understood what an ESDT is supposed to be.  When I get a concept
that says "Data Type" in its name, I expect to have a programmable data
structure identified.  I've used the term "Data Product" instead (responses
should not come back with "That's wrong!" - more useful is "What I think
when you use that term is ...").

What do I mean by that?

1.  A Data Product should contain an identification of which of a small
number
of key parameters the files in a Data Product contain.  My preliminary
suggestion
would be to start with one of the lists of Essential Climate Variables
(there are
about 40 to 100 depending on the list) and then to make these into an
enumerated
data type:
   type Essential_Climate_Variable is (Pimary_Productivity,
Surface_Temperature,
      ... Sea_Ice);
I'll send along a list of these variables next week - there are at least two
- one
with American spellings and a second done by the  Europeans with their
spelling.
The list of these parameters makes up a controlled vocabulary.
Alternatively, we
could use the IORDS parameter list from NPOESS - although that gets into a
much,
much longer list.

2.  A Data Product should contain a quantification of the time interval
covered by
the data contained in the file.  In work I did some time ago, there were
about
twenty time intervals in the EOS data.  Again, I'll see if I can find the
list of these
intervals next week.  They include 5 minutes, 1 hour, 1 day, 1 month, 16
days,
and a few others.

3.  A Data Product should contain a consistent "sampling structure" that is
common to all of the files in the collection.  These structures are probably
best
represented by 3D pictures (which can be labeled and should be categorized
as
another enumerated data type).  I've got some pictures I'll send along next
week,
but there are a few obvious candidates: an image (although here it is
important to
note that because clouds get in the way - among other physics - there is a
vertical
structure that is not just driven by topography), stereoscopic movies (this
from MISR,
where the viewing geometry of the atmosphere involves looks from different
directions),
picket fences (where the scientists involved with radars and lidars think of
the samplng
as being made up of coherent vertical columns - not images turned on their
sides),
grids (mostly for general circulation models - since real remote sensing
instruments
have fuzzy horizontal and vertical densities of sensitivity to the
properties they want
to sense), regular and irregular networks of in situ sampling, and some
others.  This would
mean that we could define a data type
   type Sampling_Structure_Type is (In_Situ_Network, Field_Experiment_Site,
Image,
      Picket_Fence, ...);

4.  This suggests that the programmable data structure for a Data Product is
something like
   type Data_Product is record
      Number_Of_Key_Parameters : integer;
      Key_Parameter_List : array (1 .. Number_Of_Key_Parameters) of
Key_Parameter;
      Time_Interval_Of_Data : Julian_Date; -- meaning the number of days of
data in a file
      Sampling_Structure : Sampling_Structure_Type;
   end record;

5.  If one were going to do this as an ontology, the hard part will be
coming up with
a consensus on the classifications involved in Key_Parameters,
Time_Intervals,
and - particularly - Sampling_Structures.  I had one experience with
subsetting
when I had an image of sampling from a CERES instantaneous data product when
the instrument was operating in rotating scan plane mode.  The field in the
file included
vertical location of clouds - and the data were not an image in the sense
that the
visualized data points were not in an array of contiguous pixels with
adjacent neighbors.
Rather, the pixels have an intricate and irregular pattern with spaces
between them.
A fellow who had been working on subsetting had never seen a data file that
didn't
contain an image and assumed that standard image segmentation algorithms
were
all he really needed to deal wiith.

It is particularly important to remember that there is a whole collection of
instruments
(like AIRS) where the scientific data structure is a vertical column - and
some (like
SAGE) where the column of measurements is along a horizontal structure that
lies along the curved line between the satellite and the Sun just as the Sun
is setting
behind the Earth.  There are also time series, which are particularly
important for
solar constant measurements.

I would be interested in knowing if there is a programmable version of the
data type
meant by an ESDT.  What I recall was that ESDT's were an IT translation of
how
scientists think of their data - and it wasn't very satisfactory,
particularly if you wanted
to combine data from different instruments.

Bruce B.

On Fri, Dec 10, 2010 at 4:13 PM, Bruce Barkstrom <brbarkstrom at gmail.com>wrote:

> Your suggestions may be a start.  My OID hierarchy (or whatever
> you might want to transmogrify that into in terms of a hierarchical
> html naming scheme) would organize the collection in an archive
> starting with the archive identifier (which should probably be registered),
> followed by the original data producer categories, followed by the
> generic collection of files (those within the EOSDIS community
> might call this the ESDT level; my publications call it the Data Product
> level - so we'll probably need to deal with aliases), followed by the
> collections based on data source (which is what I've described as
> the Data Set level - I'm not sure what the other names would be),
> followed by Data Set Versions, and so on down to individual files.
> I think the idea is similar to yours.  It is clear that we need some
> way to break down the collection into a hierarchy.
>
> I'll also note that the OAIS RM has a collection schema that turns
> out to be recursive so that what we want to describe as a collection
> hierarchy can also have a context attached - which means that it's
> fairly easy to set up a template for what each level will contain.
>
> At this point, we probably need to seek a fairly broad sample of
> collection organizations - not just the ones we've identified, but
> a real cross-section of cases that probe the full range of collection
> behavior.  You might find it interesting to try out the way you'd organize
> Ruth's collection of photos, as well as the Hurricane Ike Damage
> Assessment aerial survey (a collection of high res digital photos
> taken at nearly the same time).  Likewise, there's some things
> from NCDC with continuously updated records of temperature and
> precipitation - although the file structure is something created in
> FORTRAN.  Keeling's CO2 monthly data is a simple text file
> (again with a strong flavor of FORTRAN) - no versions, no updates
> beyond 2003, and so on.
>
> Let's keep up the discussion.
>
> Bruce B.
>
>
> On Fri, Dec 10, 2010 at 12:07 PM, Curt Tilmes <Curt.Tilmes at nasa.gov>wrote:
>
>> On 12/10/2010 11:51 AM, Bruce Barkstrom wrote:
>>
>>> Eventually, we're going to have to do some thinking about the
>>> scaling that goes with this approach.  As far as I can tell, the
>>> scaling for traversing the graph is still linear with the number of
>>> nodes.  If all the granules in ESDIS get included, we're going to
>>> have several hundred million items, including files and jobs - not
>>> to mention the possibility of subsets (fragments) of files.
>>>
>>
>> You're right, of course.  This will be a big challenge..
>>
>> Sometimes I think it would be great to have a huge triple store that
>> just pulls in everything we care about and can query it directly with
>> SPARQL, but I think that isn't feasible (or at least won't be for some
>> time).
>>
>> I think we can partition nicely along the
>> Dataset = { Collection, ESDT }
>> boundaries though.  (Collection still bothers be though -- it isn't as
>> concrete as the ArchiveSet model we use internally)
>>
>> Each Dataset has a "home" -- an Archive responsible for its curation
>> and stewardship, they could offer a URL into which the persistent
>> identifier for that Dataset (DOI) will point, and they could also
>> offer (or point to elsewhere) a SPARQL end point with the graph of
>> related nodes.  When you get to a point where you are referring to
>> another dataset owned by another archive, you hop over to their SPARQL
>> end point and continue the query.
>>
>> As a single archive grows bigger and bigger, they can just paritition
>> internally along Dataset boundaries as much as needed, offering
>> multiple databases.
>>
>> Getting back to "Collection", we need the ability to broaden it beyond
>> a single archive.  Currently every collection of a specific ESDT is
>> always owned by the same archive (if the old ones are even kept at
>> all, which is another issue) For this scheme to be scalable, we need
>> the ability for other archives to handle the same types of data,
>> Whether they must change the ESDT, or have a controlled, extended
>> namespace for Collection, or something different.
>>
>>
>> Curt
>> _______________________________________________
>> Esip-preserve mailing list
>> Esip-preserve at lists.esipfed.org
>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20101211/f5cc9b29/attachment-0001.html>


More information about the Esip-preserve mailing list