[Esip-discovery] Datacasting custom elements proposal

Fri Mar 18 17:26:56 EDT 2011

On Mar 18, 2011, at 2:04 PM, Mattmann, Chris A (388J) wrote:

> Hi Sean,
> 
>> The primary difference between <datacasting:customElement="maxWindSpeed"> and <datacasting:maxWindSpeed> is that the latter requires datacasting:maxWindSpeed to be declared in an XML schema and the former does not.
> 
> Theoretically yes. Practically, no. 

I think it's best to use elements here and not attributes. My experience is that while it's harder to design an element-centric XML document, it's easier to use them in practice, especially 'using' them with software.

XML Schema is not that hard and there's no requirement that consumers actually use it (but they can if they want to). If the schemas we use are well written and set elementFormDefault="qualified" then we will be well positioned to transform this XML into RDF, exposing it to semantic searching tools. For the latter, qualified names are important (RDF flattens hierarchies). It's possible to qualify attribute names, but that is rarely done (href in the xlink schema being a notable exception).

James

> 
> I can easily come up with an example, e.g.,:
> 
> <?xml version="1.0"
>  xmlns:foo="http://example.com/1.0/some/url/that/doesnt/exist"
>  xmlns:bar="http://example2.com/blah/"
>  xmlns:datacasting="http://datacasting.jpl.nasa.gov/foo"
> ?>
> 
> <channel>
> <item>
>  ...
>    <foo:elem1>blah</foo:elem1>
>    <bar:elem1>blah2</bar:elem1>
>    <datacasting:elem1>yadda yadda</datacasting:elem1>
> ...
> </item>
> </channel>
> 
> To validate the elements against the above, ideally, yes, the fake URLs I inserted should be real, and should reference some .xsd. In practice, though, they don't need to be real, in order to have this been a valid XML document (and, even in some cases, a valid feed, consumable by downstream OTS aggregators).
> 
> 
>> Declaring something in a schema is a fairly heavyweight operation, especially because most schema definition languages can be fairly opaque.
> 
> I'd generally agree with this, but the heavy-weightedness comes on the definition side, not on the consumer side, as validation can be turned on and off by the actual XML parser .
> 
>> Not only that, but you have to make the schema available independently via a long-lived URL.
> 
> I can't think of one of the top of my head, but I think you can also somehow bake in those definitions to the file (or feed in this case) itself. Sure, DTDs are sexy from the perspective that you can inline them, but I'm sure XSDs have the same ability. Or, maybe I'm wrong :)
> 
>> It also means that every client has to read the schema in order to determine simple things like data type.
> 
> How is data type a simple thing, just by referencing it in an attribute?
> 
> Are you proposing that just because you say:
> 
> <datacasting:customElement="maxWindSpeed" type="real">22</datacasting:customElement>
> 
> That that's simpler than:
> 
> <datacasting:windSpeed>22</datacasting:windSpeed>
> 
> And then knowing (via schema) that windSpeed is a real?
> 
>> With the attribute method, quick-and-dirty clients (e.g. python scripts, etc) could ignore validation entirely, just assume the XML is correct, and extract data.
> 
> They could in either sense. You just turn off the schema validation in the quick and dirty client parser.
> 
> In addition, if validation is not a concern, then why do we care about the above?
> 
>> This is not possible using the schema method because data type must be buried in the schema, meaning parsing of it is mandatory to do anything general with the data.  In the keyword case, you can ignore the schema.  While that's not recommended for production clients, we should not ignore the one-off script kinds of uses. 
> 
> I don't think I'm proposing that.
> 
>> 
>> Furthermore, in a lot of situations errors/mismatches in the schema will be reported by some deeply-embedded part of the XML parsing stack which is likely to make it harder for clients to parse & present these errors in an intelligible way to the user(s).
> 
> I agree that the errors may be embedded in the XML schema library that's true, but I'm not sure that it's harder for clients to parse and present the errors. There are tons of schema libraries in just about every language that provide data structures, control flow, etc., that can deal with unmarshalling those errors.
> 
>> Basically this would turn custom element metadata errors into structural errors rather than data errors. This whole thing basically boils down to the question as to whether custom metadata is a "data" issue or a "structural" issue.
> 
> Well the debate of attributes versus tags has existed in XML since it's beginning. In reality you can do just about the same things with both, it's almost a case of style.
> 
>> 
>> One thing worth nothing is that, as of this moment, the Datacasting team under Andrew Bingham has the attribute method working and implemented in the Datacasting Feed Reader (http://datacasting.jpl.nasa.gov). What we have discovered through customer use cases is that the custom metadata is of prime importance to users (as is somewhat predictable) but the burden on data providers to create conforming custom metadata is fairly high. We have seen data providers struggle with the implementation of custom metadata even in the "simpler" case of the attribute lists -- and if they are required to auto-generate XML schema and update those schema as custom metadata is added/removed it will likely further encumber the process of custom metadata injection.
> 
> I'll agree with that. 
> 
> In the case of some other projects I've seen (e.g., PDS), some of the data providers are willing to take this on but I can certainly see how it's an issue beyond traditional inlining.
> 
>> 
>> I would contend that self-describing the metadata within the RSS/Atom's channel metadata and utilizing these definitions through the tag attributes we alleviate all of these problems while maintaining the required functionality. It even fits with established Atom concepts (e.g. the "link" tag). 
> 
> Sure, fair enough. 
> 
>> 
>> Basically, I think schema should be relatively static documents describing the structure of the file, rather than highly dynamic documents that change with every revision of every feed.
> 
> Well that's the thing -- don't count schema out though. You can do some pretty advanced things with schema, even having optional elements, etc. 
> 
> Paul Ramirez and Steve Hughes and the Data Design working group are doing this for PDS right now and they have similar issues.
> 
>> 
>> If we do decide to go with the schema method, there must be a namespace dedicated ONLY to custom elements - nothing else.
> 
> I think there's a different way to do it so that custom/dynamic elements can be leveraged. Maybe we can ask Paul to join in on the conversation.
> 
>> That allows predefined structural things like datacasting:dataSource to be added in the future, which is not possible if you simply reserve some names out of the namespace and then open it up to the world.  By the same token, the schema method really requires each feed or set of related feeds to have its own namespace, to avoid any possibility of collision... because collisions have structural implications (e.g. the data type might be different). Chris has asked to see some data on collision probability and I think given the schema implications that's an excellent suggestion. 
> 
> Thanks!
> 
>> 
>> Furthermore, only one group can "own" a schema at a time, so you cannot have multiple groups sharing one schema as they would have to do a lot of coordination for updates.  With the keyword method, collisions could occur but not a problem because the metadata is already isolated within the channel.  
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann at nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> Phone: +1 (818) 354-8810
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> _______________________________________________
> Esip-discovery mailing list
> Esip-discovery at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-discovery

--
James Gallagher
jgallagher at opendap.org
406.723.8663