[Esip-preserve] File formats and mime types

Matt Jones jones at nceas.ucsb.edu
Wed Aug 3 19:51:37 EDT 2022


Hi Matt --

There is a long history of discussions on format vocabs, with vocabulary
services for formats such as Pronom, GDFR, and UDFR all playing a role in
various communities. At one point, I thought that UDFR out of the
California Digital Library was going to create a comprehensive and
persistent service, but alas it shut down. For DataONE, we ended up
creating an open and extensible format vocabulary service that is community
managed and that we ask all network members to use when classifying file
types. It includes a unique formatId, as well as metadata including the
format name, version, media type, and typical file extensions. We found
that MIME Media types (e.g., 'text/xml') were often not specific enough to
handle our format versioning needs (e.g., for metadata standards, we
recognize multiple different versions of various metadata profiles as
determined by usage by various data centers across the network). We also
have an open, Github-based approach to handle additions to the vocabulary.
An example of this format for GeoPackage:

<?xml version="1.0" encoding="UTF-8"
standalone="yes"?><d1v2:objectFormat
xmlns:d1v2="http://ns.dataone.org/service/types/v2.0">
<formatId>application/geopackage+sqlite3</formatId>
<formatName>GeoPackage Encoding Standard (OGC) Format
Family</formatName>    <formatType>DATA</formatType>    <mediaType
name="application/geopackage+sqlite3"/>
<extension>gpkg</extension></d1v2:objectFormat>


Some links:

   - DataONE Format Vocabulary: https://github.com/DataONEorg/object-formats
   - Example discussion on adding the GeoPackage format:
      https://github.com/DataONEorg/object-formats/issues/14
   - DataONE Formats Service: https://cn.dataone.org/cn/v2/formats
   - Example geopackage format from the service:
   https://cn.dataone.org/cn/v2/formats/application%2Fgeopackage%2Bsqlite3
   - Background article on UDFR
   <https://blogs.loc.gov/thesignal/2011/06/a-meeting-of-the-minds-for-udfr/>

My only issue with the DataONE formats service is that we don't explicitly
identify a URI for each format type in the definition, which in retrospect
we could have done. In practice, we use the DataONE format service URI to
represent the format in URI space when we need it for linked data and
similar applications. For example,
https://cn.dataone.org/cn/v2/formats/application%2Fgeopackage%2Bsqlite3 for
GeoPackage.

For compressed formats and compound types, we've been using the mime media
type conventions for how to indicate subtypes. For example, for ESRI zipped
Shapefiles we use `application/vnd.shp+zip`, and for SOSO-compatible
JSON-LD we use `science-on-schema.org/Dataset;ld+json` with a Media type of
`application/ld+json`. It doesn't cover all cases of subtypes, but it
covers some common ones.

Hope this is all useful to you.
Matt

*Matthew B. Jones*
ORCID: 0000-0003-0077-4738 <https://orcid.org/0000-0003-0077-4738>
Director of Informatics R&D, National Center for Ecological Analysis and
Synthesis <http://www.nceas.ucsb.edu/ecoinfo>
PI, NSF Arctic Data Center <https://arcticdata.io/>
Director, DataONE <https://dataone.org/> program
University of California Santa Barbara


On Wed, Aug 3, 2022 at 12:23 PM Matthew Mayernik via Esip-preserve <
esip-preserve at lists.esipfed.org> wrote:

> Hi all,
> At NCAR we are wanting to be more consistent in how we record data file
> formats in metadata. The goal is to potentially enable people to
> search/filter data sets based on their file formats.
>
> We have a couple of questions about this:
>
> 1. Are there standard vocabularies for file formats that you use?
> 2. Should we be using mime types for this purpose?
> 3. How do you deal with compressed file formats, such as zip or tar, where
> the actual file types of the data require more work to determine?
>
> Thanks for any insight!
>
> Best,
> Matt
>
> Matthew Mayernik, Ph.D.
> Project Scientist & Research Data Services Specialist
> NCAR Library
> National Center for Atmospheric Research (NCAR)
> University Corporation for Atmospheric Research (UCAR)
> Boulder, CO, USA
> mayernik at ucar.edu
>
> _______________________________________________
> Esip-preserve mailing list
> To start a new topic: Esip-preserve at lists.esipfed.org
> To unsubscribe and manage prefs:
> https://lists.esipfed.org/mailman/listinfo/esip-preserve
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.esipfed.org/pipermail/esip-preserve/attachments/20220803/cba5fca0/attachment.htm>


More information about the Esip-preserve mailing list