[Esip-preserve] Some Suggestions on Provenance Work

Mon Sep 20 12:00:13 EDT 2010

Hi all,

nice breakout of Bruce's stream - a good example of something we might want to capture.
so much so that I'd like to add some additional thoughts
See below

-----Original Message-----
From: esip-preserve-bounces at lists.esipfed.org [mailto:esip-preserve-bounces at lists.esipfed.org] On Behalf Of Curt Tilmes
Sent: Monday, September 20, 2010 11:28 AM
To: esip-preserve at lists.esipfed.org
Cc: rramachandran at itsc.uah.edu
Subject: Re: [Esip-preserve] Some Suggestions on Provenance Work

On 09/03/10 11:08, alicebarkstrom at verizon.net wrote:
> I sent the following note to Dr. Ramachandran before the
> teleconference on provenance tracking (or at least the parts
> identified in 1 through 3). At that point, I felt that we did not
> need a new working group and still feel that way. However, here are
> some work items on provenance that I think need to be picked up:

> 1. There are a number of workflow and provenance tracking tools,
> including Earth science workbench (Frew), Sciflow, Kepler, Taverna,
> and others. It might be useful to prepare an intercomparison of
> these tools - particularly whether they are intended primarily for
> ad hoc (or exploratory) data production or whether they might be
> adapted to use on the high-throughput production paradigms, as well
> as what kind of "database" technology they use (relational, XML,
> flat file, RDF, triple store, etc.) You could regard this as a
> preliminary form of marketing analysis, where it would be useful to
> ask how many different kinds of Earth science data have been run
> through the tool, how robust it is, and how much it will cost to
> buy, adapt, and run. I suspect this would be a useful paper if it
> can be done in a reasonable length of time (say less than six months
> to submission).

Hong Hua did a very nice overview of workflow engines in an ESIP
Webinar.  More info, including his slides are here:

    http://wiki.esipfed.org/index.php/Workflow_Engines:_Why_So_Many%3F

Another pass at this using his work as a base could result in a nice
paper.

A paper like this would help advocate use of $$resources to mature the tools that would be most useful for NASA datacenters.

> 2. In thinking about when provenance is needed, it isn't clear to me
> whether interoperability is critical.  In the most pressing cases -
> of the type that are pushed by the climate deniers and show up in
> situations like "cliimategate" - the instances of this need are
> comparatively infrequent. Second, for these pressing cases, I think
> it likely that the production history is potentially massive - and
> perhaps massive enough that XML will make it difficult to process
> the millions of files and jobs that might appear. Here, it would be
> useful to sketch out a use case that could be used to get
> engineering numbers on how many items will be in the provenance list
> and how that would scale in trying to make the record of use.

I think interoperability is becoming more critical.  There are
increasing amounts of "cross-system" and "cross-institution"
processing where A uses data from B who gets data from C, etc.  If A
represents his dependency on B in a different
manner/format/representation/whatever from the way B does, it just
makes it that much harder to track back from A to C.

in current EOSDIS practice - we archive most ancillary data with the products - because we don't have confidence that the providing agency will archive the same thing.  For planning decadal missions this is already becoming an issue.  cross-institution tracking could save $$ for decadal mission data systems.

> 3. In cases where the interoperability need is more directed toward
> understanding what was actually done, there needs to be considerable
> thought directed to "how to understand code".  In the NOAA climate
> data and the NASA EOS record, the code for a single system may be
> 50,000 lines of code - and if the data have been processed through
> several such systems, you can multiply the lines of code by the
> number of systems.  Even in the "ad hoc" approach with a graphical
> workflow engine, understanding what has been done can be decidedly
> non-trivial. I'd normally put the understanding of code bases into
> what the OAIS RM calls "context information", but we need to give
> that problem some systematic thought. While I think this kind of
> thought will need to be picked up by the Preservation WG, it's
> probably a big enough topic that it deserves some systematic
> planning where coordination might be useful.

Yes.  Big problem.  Needs work..

Current plan calls for code to be archived as reference documentation - no expectation to actually rerun it,  this leads to big dumpsters - mainly because there isn't time to organize as documentation.

> 4. To make progress on the distinctions between provenance as
> production history (and custodianship), I'm convinced we need to
> write down scenarios of data production, validation, and long-term
> data management, identify the digital artifacts produced as a result
> of these activities, and then categorize the artifacts as being used
> directly in production or used primarily by someone needing to
> understand the information.

Yes.  I think this could (should) be codified into a formal Earth
Science Data Processing Semantic Ontology.

A must - sooner than later.

> It is no longer helpful to engage in philisophical discussions of
> which verbal category things belong in.

Yeah, but isn't fun, isn't it? :-)

> At least from my perspective, if data production is going on at a
> rate of thousands of jobs per day, it seems sensible to identify
> files used in or produced by the jobs as the primary digital
> artifacts we need to worry about in production history
> provenance. This probably means that source code would be included
> in provenance because it is likely to be compiled at the production
> frequency. Reports and ancillary data, such as calibration plans,
> calibration procedures, calibration reports, and calibration data
> that do not appear directly in the production files. I suggest these
> be classified in the OAIS RM category of Context Information.

I think this mostly follows from your description: As an objective
criteria for distinguishing "Provenance" from "Context", how about
"could have a direct effect on the output"?

Provenance: raw data from space craft, calibration data that are
inputs to processing, source code for algorithm, ancillary data, etc.

Context: reports, plans, procedures, validation experiments and data,
scientific papers about the data, etc.

(NB: I put ancillary data in Provenance instead of Context.  I also
didn't address the "audit" type information, like what time I did the
processing, which custodian held the data, etc.)

FWIW, I've taken to using the term "Transparency" as an inclusive
umbrella for "Provenance and Context".  Transparency requires
independent reproducibility and understanding not just what we did
(provenance), but why we did it and how it relates to the rest of
science (context).  It ultimately contributes to credibility.  This is
distinct from Preservation and Stewardship which focus on the long
term reliability of access to the artifacts themselves.

Some things, like good identifiers are required both for transparency
and for preservation.

At the end of the day, I really don't care much about the categories,
since I intend to do all of the above -- it doesn't really affect me
that much.  I think it is just as important to point to context type
information as it is to point to provenance type information.  I am
more concerned with just getting identifiers and pointers to all the
information.

> In preservation work, both the production history artifacts and
> those in this definition of context information need to be
> formalized in an OAIS RM Representation Network that can then be
> analyzed for vulnerability to loss using something like Stochastic
> Reliability Networks.

Ok.

> I also recognize that not all of the data we create
> and use in Earth sciences are done with the same
> production paradigm. Thus, we need similar empirical
> evidence of how to categorize digital artifacts in such
> cases as
> - glacier photo collections
> - Hurricane Ike Damage Assessment Photo collection
> - in situ data such as the Global Historical Climate Network
> temperature and humidity records
> - solar constant long term composite records looking for
> trends in the solar constant (as done by Frolich and Lean,
> for example)
> Again, we need scenarios of production, validation, and
> preservation as a framework that can be used for
> provenance checking of the categorization.

I think you are dead on here.  Everyone starting from scratch ends up
developing similar production scenarios.  We may take different paths
to get their, but we all end up with systems that are remarkably
similar.  Capturing and consolidating all our various experiences and
narrowing them down into some broad paradigms would give us a nice
structure to hand the next processing system builders.  Common
identifiers and interoperability will help us link our systems
together.

Yes, Scenarios which combine real-world context and technologies. 

> It would be even more helpful to use UML with use cases and
> synchronization diagrams to help formalize the discussion and help
> us avoid leaving things out.

I prefer RDF/XML/OWL for the formal knowledge representation, but I
like UML too.

Curt

Any hope for tools to reverse engineer what we have now - derive dependencies and structure of stuff in 'dumpsters'?
-john
_______________________________________________
Esip-preserve mailing list
Esip-preserve at lists.esipfed.org
http://www.lists.esipfed.org/mailman/listinfo/esip-preserve