[Esip-preserve] Monthly Telecon: Wednesday February 9 [NOTE: New Number!]

Mon Feb 7 15:02:59 EST 2011

I think we need to add another item to the list of new "projects":
a set of fully worked out examples of citation scenarios that include
at least the following:
a.  What kind of "atomic" item is being cited (choosing from a small list
that
should probably include at least the following: a data file, a data element
within
a file, a relational (or other) database, a job "residue)?
[Note that what is sometimes called a "dataset" is a collection of these
"atomic" items and should not be included in this list.  If there's a need
to get technical, I'll note that the OAIS Reference Model states that an
archive consists of Archival Information Packages that are either Archive
Information Collections (AICs) or Archive Information Units (AIUs).  When
I said "atomic" items, I mean AIUs.  In this language, I'd interpret
"granules"
as being Dissemination Information Packages that have already be
pre-packaged by an archive for managerial convenience.  The DIP's would
be made of AIU's.]
b.  How many "atomic" items are in a typical citation for the scenario
being considered?
c.  In order to have the cited items be usable or useful, what other digital
or physical objects need to be available?

As a concrete example, I've thought of Ruth's glacier photo collection.
The physical items that might be cited are the developed photographs
done by many different photographers.  Such physical objects are
one kind of AIU.  The "atomic" digital artifacts are the electronic images,
mostly tiff files, although one might also include jpg's, and a small number

of other images in other formats.  In terms of the OAIS RM terms, the
electronic images are the AIU's.

As a scenario, suppose I wanted to use the photos in the collection to get
a survey of the change in glacier coverage or glacier size in the last
century.
If my understanding of the spreadsheet that Ruth sent me a year or so ago
is correct, there are 200,000 digital images in the total collection.  The
earliest
image in the 10,000 item spreadsheet I received was taken from a digitized
image
whose original photo was taken in 1898.  About half (or so) of the images
are
connected with named glaciers.  The sampling of any particular named glacier

is temporally intermittent or sporadic.  So a survey might include up to
something
like 100,000 images.  If I wanted to do comparative geographic studies, then

I might select one or two smaller areas for a pilot study of the feasibility
of
determining how accurately I could determine changes in glacial area.  In
order
to report on the pilot study in a way that would allow another investigator
to verify or
replicate my work, I think I'd need to cite which scanned images I had used
in
these two smaller samples.  That's probably something like10,000 references.

For understanding the pilot study, if I recall my earlier publications on
radiative
transfer in snow, there are a couple of important facts I'd need to keep in
mind:

a.  Snow reflection is not isotropic (in other words, more light is
reflected in some
directions than in others) and depends on both the zenith angle and azimuth
of
the Sun, on the cloud coverage (and lots of other optical stuff).  Thus, if
I were
going to try to be quantitative, I'd need the location of the Sun - which
could be
derived if the Date and UTC of the photo were recorded.

b.  I'd need to know the direction in which the optical axis of the camera
was
looking, as well as the location of the camera so I could deal with the
geometry
of reflection.

c.  I'd need to know how well the camera was calibrated.  (I spent about
four
years in graduate school taking telescope images on glass plates and
developing
them in the observatory dark room.  I have a fairly pessimistic view of
absolute
calibration of photographic images.)  Such a calibration would usually
involve
exposing the film (or glass plate) to a calibration source.  I don't know if
this
was done on the images in the photo collection.  I also don't know how the
scanning of the original images was calibrated.

d.  If the pilot investigation used relative brightness and area, then it
would
be necessary to say something about the algorithm being used.

Note that the numbers of references is important to the practicality of
citation,
while the context material is needed for enumerating items that need
preservation.
Thus, this kind of material is needed for justifying preservation of
particular items.

While the previous material is an example of what I think we need to
collect,
I don't think it is necessarily a representative sample of what's involved
in
citations.  Other communities may (and almost certainly do) have different
scenarios.

For example, I think that the bioinformatics community is making heavy use
of genetic sequencing databases.  Without checking, my recollection is that
the individual elements that go into the databases are rather small sets of
character
strings and numbers that come from an individual experiment, say a partial
DNA sequence of proteins.  A citation would then refer to results from some
experiments.  This notion is clearly in need of checking, but if correct,
would
lead to quite different scaling properties than would the scenario on the
data
from the glacier photo collection.

If we were dealing with economics or demographic data, my experience with
the tabulated values for the U.S. population's estimated size or the
input-output
matrices is that they are published (in at least one form) by the US Dept.
of
Commerce as part of the U.S. Statistical Abstract.  This used to be
published
as a paperback volume, with sources for the various tables listed as
footnotes.
Later, the Abstract was available as a CD, with the individual tables given
as
excel spreadsheets in a CD.  Currently, I believe that the individual tables
are
available on-line for download without charge.  The sources of the tables
are
still included in the excel spreadsheets.  In this case, I suspect that the
citation
habits of the community would be to refer to the individual table, citing
either
the DoC publication or the original sources.  While the tables often change
from
one year to another, I think they are stable once published (unless DoC
publishes
a collection of errata).  The input-output matrices for the economy take
about
five years to put together and do not change between publications.  This
experience
suggests that the citations in this kind of socio-economic community are
likely
to be to tables in excel spreadsheets, with a small number of references.
Again,
this assumption needs checking.

Given these three kinds of citation "customs," I'm not sure we should assume
that scaling for citations is the same between different communities.  That
should
be checked.

So, I'd like to see this kind of activity added to our discussions on
Wednesday.
It does need to dovetail with the inclination to work toward a "standard"
for "materials
that need to be preserved".

Bruce B.

On Mon, Feb 7, 2011 at 8:33 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> ESIP Preservation and Stewardship Cluster Monthly Telecon
>
> Wednesday 2011-02-09  (2nd Wednesday of each month)
> 3:00pm Eastern, 1:00pm Mountain
>
> Telecon number: 877-934-1565
> Passcode: 5716503#
>
> Agenda:
>
> * ESIP Winter Meeting review
>
> * ESIP Collaboration Check-in / Feedback
>
> * Activities Status
>  - IDs paper - Ruth
>  - IDs testbed activities - Nancy
>  - Provenance paper - Curt
>  - ESIP Stewardship Principles and Best Practices - What next?
>  - Citations? - Mark?
>  - Provenance and Context Content Standard - Rama
>  - Ontology? - Hook?
>
> * Future Meetings
>  - ESIP Summer Meeting planning
>  - Fall AGU?
>
> Other suggestions?
> _______________________________________________
> Esip-preserve mailing list
> Esip-preserve at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110207/baa7638b/attachment.html>