<div dir="ltr"><div>Hi Matt --</div><div><br></div><div>There is a long history of discussions on format vocabs, with vocabulary services for formats such as Pronom, GDFR, and UDFR all playing a role in various communities. At one point, I thought that UDFR out of the California Digital Library was going to create a comprehensive and persistent service, but alas it shut down. For DataONE, we ended up creating an open and extensible format vocabulary service that is community managed and that we ask all network members to use when classifying file types. It includes a unique formatId, as well as metadata including the format name, version, media type, and typical file extensions. We found that MIME Media types (e.g., 'text/xml') were often not specific enough to handle our format versioning needs (e.g., for metadata standards, we recognize multiple different versions of various metadata profiles as determined by usage by various data centers across the network). We also have an open, Github-based approach to handle additions to the vocabulary. An example of this format for GeoPackage:</div><div><br></div><div><pre id="gmail-line1"><span class="gmail-pi"><?xml version="1.0" encoding="UTF-8" standalone="yes"?></span><span>
<span id="gmail-line2"></span></span><span><<span class="gmail-start-tag"><span><span class="gmail-attribute-name">d1v2</span></span>:objectFormat</span> <span class="gmail-attribute-name">xmlns:d1v2</span>="<a class="gmail-attribute-value">http://ns.dataone.org/service/types/v2.0</a>"></span><span>
<span id="gmail-line3"></span> </span><span><<span class="gmail-start-tag">formatId</span>></span><span>application/geopackage+sqlite3</span><span></<span class="end-tag">formatId</span>></span><span>
<span id="gmail-line4"></span> </span><span><<span class="gmail-start-tag">formatName</span>></span><span>GeoPackage Encoding Standard (OGC) Format Family</span><span></<span class="end-tag">formatName</span>></span><span>
<span id="gmail-line5"></span> </span><span><<span class="gmail-start-tag">formatType</span>></span><span>DATA</span><span></<span class="end-tag">formatType</span>></span><span>
<span id="gmail-line6"></span> </span><span><<span class="gmail-start-tag">mediaType</span> <span class="gmail-attribute-name">name</span>="<a class="gmail-attribute-value">application/geopackage+sqlite3</a>"<span>/</span>></span><span>
<span id="gmail-line7"></span> </span><span><<span class="gmail-start-tag">extension</span>></span><span>gpkg</span><span></<span class="end-tag">extension</span>></span><span>
<span id="gmail-line8"></span></span><span></<span class="end-tag"><span><span class="gmail-attribute-name">d1v2</span></span>:objectFormat</span>></span></pre></div><div><br></div><div>Some links:</div><div><ul><li>DataONE Format Vocabulary: <a href="https://github.com/DataONEorg/object-formats">https://github.com/DataONEorg/object-formats</a> <br></li><ul><li>Example discussion on adding the GeoPackage format: <a href="https://github.com/DataONEorg/object-formats/issues/14">https://github.com/DataONEorg/object-formats/issues/14</a></li></ul><li>DataONE Formats Service: <a href="https://cn.dataone.org/cn/v2/formats">https://cn.dataone.org/cn/v2/formats</a></li><li>Example geopackage format from the service: <a href="https://cn.dataone.org/cn/v2/formats/application%2Fgeopackage%2Bsqlite3">https://cn.dataone.org/cn/v2/formats/application%2Fgeopackage%2Bsqlite3</a></li><li>Background <a href="https://blogs.loc.gov/thesignal/2011/06/a-meeting-of-the-minds-for-udfr/">article on UDFR</a><br></li></ul></div><div>My only issue with the DataONE formats service is that we don't explicitly identify a URI for each format type in the definition, which in retrospect we could have done. In practice, we use the DataONE format service URI to represent the format in URI space when we need it for linked data and similar applications. For example, <a href="https://cn.dataone.org/cn/v2/formats/application%2Fgeopackage%2Bsqlite3">https://cn.dataone.org/cn/v2/formats/application%2Fgeopackage%2Bsqlite3</a> for GeoPackage.</div><div><br></div><div>For compressed formats and compound types, we've been using the mime media type conventions for how to indicate subtypes. For example, for ESRI zipped Shapefiles we use `application/vnd.shp+zip`, and for SOSO-compatible JSON-LD we use `<a href="http://science-on-schema.org/Dataset;ld+json`">science-on-schema.org/Dataset;ld+json`</a> with a Media type of `application/ld+json`. It doesn't cover all cases of subtypes, but it covers some common ones.</div><div><br></div><div>Hope this is all useful to you.<br></div><div>Matt<br></div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><b>Matthew B. Jones</b></div><div>ORCID: <a href="https://orcid.org/0000-0003-0077-4738" target="_blank">0000-0003-0077-4738</a></div><div>
Director of Informatics R&D, <a href="http://www.nceas.ucsb.edu/ecoinfo" style="color:rgb(17,85,204)" target="_blank">National Center for Ecological Analysis and Synthesis</a></div><div>PI, NSF <a href="https://arcticdata.io/" style="color:rgb(17,85,204)" target="_blank">Arctic Data Center</a></div><div>Director, <a href="https://dataone.org/" style="color:rgb(17,85,204)" target="_blank">DataONE</a> program
</div><div>
University of California Santa Barbara</div></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 3, 2022 at 12:23 PM Matthew Mayernik via Esip-preserve <<a href="mailto:esip-preserve@lists.esipfed.org">esip-preserve@lists.esipfed.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi all,<div>At NCAR we are wanting to be more consistent in how we record data file formats in metadata. The goal is to potentially enable people to search/filter data sets based on their file formats.</div><div><br></div><div>We have a couple of questions about this:</div><div><br></div><div>1. Are there standard vocabularies for file formats that you use?</div><div>2. Should we be using mime types for this purpose? </div><div>3. How do you deal with compressed file formats, such as zip or tar, where the actual file types of the data require more work to determine?</div><div><br></div><div>Thanks for any insight!</div><div><br></div><div>Best,</div><div>Matt</div><div><br></div><div>Matthew Mayernik, Ph.D.</div><div>Project Scientist & Research Data Services Specialist</div><div>NCAR Library</div><div>National Center for Atmospheric Research (NCAR)</div><div>University Corporation for Atmospheric Research (UCAR)</div><div>Boulder, CO, USA</div><div><a href="mailto:mayernik@ucar.edu" target="_blank">mayernik@ucar.edu</a></div></div>
<br>
_______________________________________________<br>
Esip-preserve mailing list<br>
To start a new topic: <a href="mailto:Esip-preserve@lists.esipfed.org" target="_blank">Esip-preserve@lists.esipfed.org</a><br>
To unsubscribe and manage prefs: <a href="https://lists.esipfed.org/mailman/listinfo/esip-preserve" rel="noreferrer" target="_blank">https://lists.esipfed.org/mailman/listinfo/esip-preserve</a><br>
</blockquote></div>