[Esip-preserve] [ESIP-all] Please review Draft ESIP Data Citation Guidelines

Fri Aug 26 14:26:30 EDT 2011

Peter,

You highlight how ultimately much of this is a subjective exercise and how the situation may vary among communities, but such is the nature of conventional citation, too. Anyway, I embed some thought below.

On 26 Aug 2011, at 12:01 PM, Peter Cornillon wrote:

> Hi Mark,
> 
> Thanks for the response. More in-line below.
> 
> On Aug 26, 2011, at 12:01 PM, Mark A. Parsons wrote:
> 
>> Hi Peter,
>> 
>> Thanks for your interest. Ironically, I was unable to respond right away because I was at a Data Citation meeting hosted by the National Academy.
>> 
>> Anyway, let me see what I can do with your situation. I copy the rest of the cluster, in case someone else want to chime in.
>> 
>> My first recommendation is that you actually cite the data in the references section not just an acknowledgement blurb.
> 
> Yes, if I understand you here, we are doing that. What I am looking for is the verbage that someone who uses the data that we produce should use when acknowledging the data in a pub.

If someone uses the data you produce, they should *cite* the data set in a manner that you recommend, i.e. Cornillon, et al. 2011
> 
>> In your case, the tricky part, then is figuring out who the data "author", "publishers", etc. are for the citation.
>> 
>> My first question is what are you doing with the derived data sets. Are you archiving or distributing them anywhere?
> 
> They will be made available on our web site as netCDF 4 files and via OPeNDAP. The netCDF files will conform to CF 1.5.

OK, then the derived data should be cited when used.

> 
>> If so, then I see those as the citable objects. The documentation for those data sets should in turn cite the SST data from NODC. If you are not distributing the derived products, then I think you should cite the original SST data and describe the edge detection process in the methods section of your paper.
> 
> Right, we will do that in our pubs that make use of the data, but, as I said above, we want to distribute the data sets that we produce.

When you distribute the data sets, you should recommend they be formally cited and give an example of how to do it.
> 
>> My second question is how significant do you think the edge detection process was scientifically and intellectually? Have you created a new data set, a new intellectual product, or is it more accurate to say it was a more minor manipulation or "edit" of the original data set?
> 
> In the case of the fronts data they are a new product. The gradient data are a little less clear in that we apply a Sobel operator to the data which, and I'm guessing here, is not reversible, but even if it were, we median filter the input data which is irreversible, so I think that we are OK in saying that the edit produces a scientifically new product. I'm assuming that it's reversibility of the data that determines a scientifically new product. Is that right or is more subtle than that?

I think you as the oceanographer are better positioned to judge what is scientifically new, but irreversibility seems like one suitable criterion. You may also consider the intellectual effort of creating the new product. Where you researching and creating a new algorithm or were you applying some set of know approaches in a more mechanistic way?

> 
>> In the first case, you or your team would be the "author" of the new data set. In the second case, you might be considered an "editor"
>> 
>> So given all that here is how I suggest you cite these data to help ensure both credit and validation, using the following elelments:
>> 
>> • Author(s)--the people or organizations responsible for the intellectual work to develop the data set. The data creators.
>> • Release Date--when the particular version of the data set was first made available for use (and potential citation) by others.
>> • Title--the formal title of the data set
>> 	• Version--the precise version of the data used. Careful version tracking is critical to accurate citation.
>> • Archive and/or Distributor--the organization distributing or caring for the data, ideally over the long term.
>> • Locator/Identifier--this could be a URL but ideally it should be a persistant service, such as a DOI, Handle or ARK, that resolves to the current location of the data in question.
> 
> We are creating a UUID for each file that we produce. 

That's good, but it doesn't really address the reference or  "how to get the data" function of the citation, so lacking a DOI, or Handle or some such, I would just use the URL. The UUIDs could be useful for specifying versions and provenance records and such, but including them in the citation is probably impractical. Best to have that be part of other metadata.
> 
>> 	• Access Date and Time--because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed.
>> 
>> 
>> 
>> For the original SSTs:
>> 
>> Person or team at Miami. Initial release date of the version used. "Cool SST data set, version x.x". National Oceanographic Data Center. Access URL or any kind of persistent location or identifier provided by NODC. Accessed on date.
>> 
>> For the edge detection data
>> 
>> Cornillon, P. et al., Date made available. "Cool edge data set, version x.x" URhode Island or whoever is distributing the data. Access URL or get a DOI. Date accessed.
>> 
>> or
>> 
>> Miami team. release date. "Cool edge data set, version x.x" edited by Cornillon, et al. URhode Island or whoever is distributing the data. Access URL or get a DOI. Date accessed.
> 
> I guess that the real issue here is what does 'editing' someone's data means? If I understand what you are saying, if one is not editing someone else's data set, but deriving new value out of it, then the original data are not acknowledged? This might make sense in that it could just get too complicated if one had to acknowledge then entire parentage of every data set used, but it still bugs me a bit. In this example, I could not have produced my fronts data set without the data from Miami.
> 
> Sorry for being so dense on this. 

No, you are absolutely correct, the credit issue can get complex. If possible, you may want to discuss with the Miami team. Perhaps you and Miami could be considered co-authors of the front data set. One other alternative is to recommend that both data sets are cited: "Cornillon et al. (2011) derived from Miami et al. (2011)"

Regardless, of which way you go, the documentation and metadata of the Cornillon data set should directly cite the Miami data set.

Finally, to reiterate, I strongly recommend that these be presented as formal references, not just acknowledgements. This raises the profile of the credit and makes tracking the impact a bit easier.

I hope this helps.

Cheers,

-m.