[Esip-preserve] Data Citations in the Age of Electronic Publication

Bruce Barkstrom brbarkstrom at gmail.com
Wed Sep 5 09:18:18 EDT 2012

Good suggestions.

The Adam Smith electronic book would fit with the first
paragraph of your suggestion, since the chapters are
already partitioned into sections.

Building an index to the parts for random access is
a more complex issue. I've got an automated
procedure for building an index in my book manuscripts,
but it's a pretty complex one:
   - run the TeX text through a parser that separates the
      material into words
   - make the words all lower case (or upper case)
   - remove uninteresting short words or words that
      shouldn't go into the index (`a', `and', `the', etc.)
   - go back through the text for each of the remaining
      words in the list and identify the context so that
      the program can insert the indexing markup in
      the TeX file
   - run the TeX indexing file to actually create the
      index page references

Using a hierarchical partitioning scheme also strikes
me as useful - although the diversity of schemas may
make it very difficult to reach a consensus.  Mike Folk had
commented to me about this.  As I recall, he regarded
the variety of partitionings that showed up as a real can
of worms.

The diversity is even more difficult since it's possible to
rearrange tables or multi-dimensional arrays of numerical
values without really losing information.  Thus, one may
not be able to assume that when data producers or curators
or users who "republish" data may decide to replace one file
with an "equivalent" file in a different format.  In the new
"version", the sectioning might be quite different.
[I won't go into an exposition on my experiences with
NCDC precip files, where I rearranged and reformatted
the data, although if asked, I can provide details.]

To back up in generality, books use two schemes for
referencing specific items: a hierarchical structure that we
usually see in the Table of Contents and a random access
structure that we see in the Index.  Maybe we need to
consider similar approaches to referencing material.

Perhaps we should consider this to be a research topic.
I suspect we need some scenarios or use cases that show
when this kind of referencing is needed in order to preserve
information or provide appropriate "scholarship."

Again, thanks for the comments.

Bruce B.

On Wed, Sep 5, 2012 at 8:00 AM, Curt Tilmes <Curt.Tilmes at nasa.gov> wrote:

> The the PDF, in addition to the page number I would include
> the section reference "Book IV, Chapter II"  Still a pretty
> blunt tool, but it will survive printings with different
> formats.
> For the satellite example, I think you've got to consider
> both humans and machines trying to figure out certain
> information.  The reference could follow a similar scheme,
> with Volume, Book, Chapter, Section, Part, etc. serving
> to narrow down a text reference, we can use things like
> Granule, File, SDS (for HDF and friends), channel, tile
> number, scan number, etc. For XML files, XPath provides
> a very nice, very precise way of navigating the structured
> data file to the fields of interest.
> Curt
> On 9/2/12 9:08 AM, Bruce Barkstrom wrote:
>> For "recreational reading", I decided to peruse Adam Smith's
>> "The Wealth of Nations" to see if I could find some context
>> on his use of the famous term "invisible hand".  I thought this
>> referencing would be trivial and so I dug out my edition of the
>> printed book from Barnes and Noble.  Getting this copy was a
>> mistake - there's no index
>> I wanted to make some notes for comparison with works by other
>> authors.  With print works, I would usually reference a page number.
>> Not having an index is thus a serious impediment
>> to referencing the term in the original text.  I then tried the Internet
>> and found a free, downloadable version of the text as a pdf file.
>> Luckily, this version of the book's text included page numbers,
>> although I don't think there's a reference to which print edition
>> the copy came from.  The electronic copy is in pdf, so
>> I could use full text search.  Then, I could compare page numbers
>> in the Barnes and Noble edition with the ones in the on-line pdf.
>> The term "invisible hand" appears in the B&N edition on p. 300,
>> while the electronic file has it on p. 364.  The term is apparently
>> used just once.  As a mild added frustration, the pdf file doesn't
>> have an index either.  Could just be that Smith's original work
>> doesn't have an index, since it was published in 1776 and as
>> far as I can recall, adding an index to a book was not standard
>> until much later in the history of publishing.
>> If I extend the issue of specific references to material in Earth
>> science data, it creates some interesting scenarios.  Our usual
>> discussion of citations seems to treat these references as a pretty
>> "blunt" tool.  If I recall correctly, annotation schemes in the
>> humanities have a great many details for making the cited
>> material precise.  I'm not sure our discussion of citations have
>> the same level of precision.
>> If all we care about is giving credit to the "authors" or "editors",
>> the approach we've taken so far is probably adequate.  I think
>> it would let other researchers provide entries for bibliographies
>> or lists of references in published papers.  These would add to
>> professional credit for younger members of the academic
>> communities.
>> However, there are other cases where we need much more careful
>> references.  As a concrete example, consider trying to develop the
>> appropriate citation of the calibration gain used to derive the
>> reflected radiance in a channel of a satellite-borne instrument that was
>> going
>> to be used in determining whether a scene was cloudy.  Changing
>> the gain might change the pixels identified as "cloudy", so the
>> calibration is critical to determining the cloud cover or
>> the vegetation properties of the scene in some interesting part
>> of an image.  To complicate the context, assume that the measurement
>> was being made several years after launch and that the data production
>> source code that produced the data had undergone several revisions.
>> The same is true of the calibration coefficients.  How do we create
>> citations that will reference the proper pixels of interest in the data
>> file, along with the proper version of the source code and the proper
>> version of the calibration coefficients?  Note that neither the data
>> nor the calibration coefficients are necessarily in pdf files for which
>> it would be possible to do a "full-text" search - and it might very
>> well be that there would be so many references to particular numerical
>> values that a human reserarcher would be overwhelmed.  In addition, while
>> "provenance tracking" is certainly an element in this scenario,
>> it isn't necessarily the only part of the problem.  For "scholarly" work
>> at the
>> resolution of humanities work, we're also going to need to be
>> able to deal with references to subsets of data and context in
>> ways that allow us to find things -- even if the data format in
>> an archive rearranges the order of the data elements - or separates
>> the original content into new containers.
>> Any notion about how to deal with this kind of issue?
>> Bruce B.
>> On Fri, Aug 31, 2012 at 12:50 PM, Mark A. Parsons <parsonsm at nsidc.org
>> <mailto:parsonsm at nsidc.org>> wrote:
>>     Sorry small change to the first one
>>     Federation of Earth Science Information Partners (ESIP). 2012.  Data
>>     Citation Guidelines for Data Providers and Archives. edited by M. A.
>>     Parsons, B. Barkstrom, R. R. Downs, R. Duerr, C. Tilmes  and the
>>     ESIP Data Preservation and Stewardship Committee. ESIP Commons. [DOI
>>     or ARK].
>>     -m.
>>     On 31 Aug 2012, at 10:47 AM, Mark A. Parsons wrote:
>>      > Hi Erin and Commons Committee
>>      >
>>      > The Preservation and Stewardship has agreed on how we think our
>>     two documents should be cited:
>>      >
>>      > Federation of Earth Science Information Partners (ESIP). 2012.
>>       Data Citation Guidelines for Data Providers and Archives. edited
>>     by M. A. Parsons, B. Barkstrom, R. Downs, R. Duerr, C. Tilmes  and
>>     the ESIP Data Preservation and Stewardship Committee. ESIP Commons.
>>     [DOI or ARK].
>>      >
>>      > Federation of Earth Science Information Partners (ESIP). 2012.
>>     Interagency Data Stewardship Guidelines . edited by H. K.
>>     Ramapriyan, R. Duerr, and the ESIP Data Preservation and Stewardship
>>     Committee. 2012. ESIP Commons. [DOI or ARK].
>>      >
>>      > Please advise when identifiers have been assigned.
>>      >
>>      > Cheers,
>>      >
>>      > -m.
>>     ______________________________**_________________
>>     Esip-preserve mailing list
>>     Esip-preserve at lists.esipfed.**org <Esip-preserve at lists.esipfed.org><mailto:
>> Esip-preserve at lists.**esipfed.org <Esip-preserve at lists.esipfed.org>>
>>     http://www.lists.esipfed.org/**mailman/listinfo/esip-preserve<http://www.lists.esipfed.org/mailman/listinfo/esip-preserve>
> --
> Curt Tilmes, Ph.D.
> U.S. Global Change Research Program
> 1717 Pennsylvania Avenue NW, Suite 250
> Washington, D.C. 20006, USA
> +1 202-419-3479 (office)
> +1 443-987-6228 (cell)
> globalchange.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20120905/1b0c7809/attachment.html>

More information about the Esip-preserve mailing list