[Esip-discovery] Datacasting custom elements proposal

Sat Mar 19 14:35:23 EDT 2011

On Mar 19, 2011, at 1:48 AM, Mccleese, Sean W (388A) wrote:

>> Theoretically yes. Practically, no.
>>
>> I can easily come up with an example, e.g.,:
>>
>> <?xml version="1.0"
>>   xmlns:foo="http://example.com/1.0/some/url/that/doesnt/exist"
>>   xmlns:bar="http://example2.com/blah/"
>>   xmlns:datacasting="http://datacasting.jpl.nasa.gov/foo"
>> ?>
>>
>> <channel>
>>  <item>
>>   ...
>>     <foo:elem1>blah</foo:elem1>
>>     <bar:elem1>blah2</bar:elem1>
>>     <datacasting:elem1>yadda yadda</datacasting:elem1>
>> ...
>> </item>
>> </channel>
>>
>> To validate the elements against the above, ideally, yes, the fake URLs I inserted should be real, and should reference some .xsd. In practice, though, they don't need to be real, in order to have this been a valid XML document (and, even in some cases, a valid feed, consumable by downstream OTS aggregators).
>
> Ah, you're making our argument for us.  ;-)

Who's arguing?

> The whole idea is that you
> have to store the metadata about the metadata somewhere.  If not the
> channel metadata, then in the schema.  And if it's possible, and even
> likely, for the schema to not even be there, then we have a problem.

What's the problem? Instead of talking in the abstract, I've tried my best to demonstrate actual examples to use to talk through what I'm describing wherever possible. My examples above were in response to Sean's comment:

>> The primary difference between<datacasting:customElement="maxWindSpeed">  and<datacasting:maxWindSpeed>  is that the latter requires datacasting:maxWindSpeed to be declared in an XML schema and the former does not.

That's simply not true. Hence my example above demonstrating that namespacing can be used for things beyond validation, arguably to do just what it says: "namespace", i.e., delineate a family of elements from another family of potentially (conflicting) elements.

>
> Besides, you're encouraging XML that will not validate.

No I'm not. I'm simply pointing out that an example that shows that Sean's statement is not true in general.

>  Having a static
> schema means that it WILL validate, without having to worry about the
> data provider maintaining a separate schema file.

What's a static schema? If you're equating a static schema to some home-brew way of codifying schema information in XML tag elements, and that in having such a home brew it obviates the need to go to an external schema file, then yes and no.

Yes, having the home-brew means you *do* have a mechanism to validate the m-a-m in a single file. However, XML Schema allows that too, see:

http://msdn.microsoft.com/en-us/library/aa302288.aspx

No, it doesn't remove any worry, besides that of this-is-the-whole-reason-that-factoring out pieces of the file into separate external files came to be.

You'll note in either case, it's pretty ugly:

1. In the case of the datacasting home-brew, you've got to understand how datacasting codifies its m-a-m in a single file, that to me, is cluttered with attributes everywhere containing the m-a-m.
2. In the case of XMLSchema (and its inlining option), you've got your schema inlined with your actual met instance. Ugly too.

>
> There are really two concepts here that are being conflated.  One is the
> idea of where the metadata-about-the-metadata goes.  The options there
> are in the schema file, or in the channel.

Actually, we're talking about *both* the item *and* the channel here. See:

http://ghrsst.jpl.nasa.gov/datacasting/AMSRE-L2P-gen.xml

> The second concept is
> whether the tags themselves are schema-driven (datacasting:windSpeed) or
> keyword-driven (datacasting:customElement name=windSpeed).

It's beyond tags.

This isn't the forum for a lesson on XML Schema, but XML Schema codifies m-a-m about tags, attributes, structures, entire documents, and sets of documents, and is a W3C standard. Why should the datacasting project invent yet another one?

>
> Taking just the where-does-it-live concept, if the schema language does
> not allow additional metadata-about-the-metadata to be added such as
> conforms-to-standard, processing-method, etc., then you pretty much
> *have* to describe the m-a-t-m in the channel.
> As far as I know, schema
> languages allow you to describe the structure, but not so much arbitrary
> m-a-t-m outside of comment/description fields, which are not
> machine-parseable.

What's stopping the datacasting project from defining an XML Schema with those elements (conforms-to-standard, processing-method, etc.) and then making them optional elements that are declared in XML instance documents? There's nothing about XML Schema that prevents that. In fact, it's precisely what it encourages.
>
> Taking just the schema-driven vs. keyword-driven debate, that is an
> endless one.  ;-)

Heh, yes it sure is.

> To some extent it doesn't matter, if you don't care
> about validation.

I disagree. It matters a bunch of you care about standardization, down-stream tool support, already-existing validation that you don't have to bake up and write over again, and a number of other software engineering and software reuse principles.

>  But is it wise to not care?  It seems like we should
> not be advocating a system where it is easy (and practically speaking,
> common) to create unvalidateable XML.

No one proposed creating a system around my simple example to illustrate a point above.

>
>
>>> Declaring something in a schema is a fairly heavyweight operation, especially because most schema definition languages can be fairly opaque.
>>
>> I'd generally agree with this, but the heavy-weightedness comes on the definition side, not on the consumer side, as validation can be turned on and off by the actual XML parser .
>
> Depends on how you define it.  I see data providers as "users" of the
> casting service that we are trying to define.
> Therefore it IS a
> responsibility of the users to worry about the schema.

Who's talking about worrying? I said that the "heavy-weightedness" of declaring "something in schema" comes form a consumer side, because as you mentioned above there is no arguing that schema leads to compact XML document "instances", hence the consumer side.

I'm not arguing with the fact that the schema itself is a concern from both the provider and consumer side *in general*. +1 to that.

>  It's true that
> the final end-users don't have to, and that data providers should be
> more sophisticated than end users.  But in our limited experience, data
> providers have trouble even following the datacasting spec properly.
> Especially at first, they're likely to see this as a sideshow, they'll
> get it up but aren't that enthusiastic and thus pay more attention to
> the "main" data.

This would seem to suggest that external XML Schema, maintained by the datacasting team and then extended where needed, be used, and that a more compact, easier-for-the-data-provider format based on schema, be used.

>
> Well unless you're sure, that's not an argument we can hang our hat on.
>  ;-)  Even if so, the schemas I've seen are universally ugly and
> including them inline in the feed just doesn't taste right.

Re-defining a method of codifying m-a-m that's already present in a W3C standard that's existed for well over 10+ years in a home-brew doesn't  taste right to me.

>>
>>> It also means that every client has to read the schema in order to determine simple things like data type.
>>
>> How is data type a simple thing, just by referencing it in an attribute?
>>
>> Are you proposing that just because you say:
>>
>> <datacasting:customElement="maxWindSpeed" type="real">22</datacasting:customElement>
>>
>> That that's simpler than:
>>
>> <datacasting:windSpeed>22</datacasting:windSpeed>
>>
>> And then knowing (via schema) that windSpeed is a real?
>
> You're conflating the two concepts I talked about above.  The type would
> not be in the item, it would be in the item definition in the channel.

My example certainly is conflating them, +1, thanks for pointing that out  Bob.

However, the point of my example was to illustrate that Sean's point about "reading the schema" in order to determine simple things like data type doesn't seem to hold up. *Either way*, you have to read the schema, whether you read datacasting's home-brew method of codifying it, versus reading the schema via W3C's XML Schema standard.

> And yes, that's far simpler, because you have to parse just one document
> (the feed) and you can completely ignore the schema.
>
> Data type is critical because general-purpose applications will need to
> work with unknown metadata.  That's the whole point.

You'll find no disagreement with that with me!

>
>>
>>> With the attribute method, quick-and-dirty clients (e.g. python scripts, etc) could ignore validation entirely, just assume the XML is correct, and extract data.
>>
>> They could in either sense. You just turn off the schema validation in the quick and dirty client parser.
>
> True.  But you lose data type info that way.

Yes, I know. But you do in either example, Sean's or my own? That's not the point.

>
>>
>> In addition, if validation is not a concern, then why do we care about the above?
>
> Why would it NOT be a concern?

Why don't you ask Sean, who suggested it?

>
>>
>>>  This is not possible using the schema method because data type must be buried in the schema, meaning parsing of it is mandatory to do anything general with the data.  In the keyword case, you can ignore the schema.  While that's not recommended for production clients, we should not ignore the one-off script kinds of uses.
>>
>> I don't think I'm proposing that.
>
> Where are you proposing the data type be stored then?

In XML Schema, either inlined in the document (not as good), or factored out into a separate referenced XML Schema (better).

>>
>>>
>>> Furthermore, in a lot of situations errors/mismatches in the schema will be reported by some deeply-embedded part of the XML parsing stack which is likely to make it harder for clients to parse&  present these errors in an intelligible way to the user(s).
>>
>> I agree that the errors may be embedded in the XML schema library that's true, but I'm not sure that it's harder for clients to parse and present the errors. There are tons of schema libraries in just about every language that provide data structures, control flow, etc., that can deal with unmarshalling those errors.
>
> It's certainly possible, but harder.  In the keyword-driven case, you
> know that any XML structure error is a Very Bad Thing and you can be
> justified in rejecting the document out of hand.  Errors in metadata
> keywords would show up as lookup failures against the metadata table and
> could be easily ignored, on the theory that the rest of the data is
> probably still good.  In the schema case, you'd have to separate the two
> based on XML library messages.

Huh?

>
>>
>>> Basically this would turn custom element metadata errors into structural errors rather than data errors. This whole thing basically boils down to the question as to whether custom metadata is a "data" issue or a "structural" issue.
>>
>> Well the debate of attributes versus tags has existed in XML since it's beginning. In reality you can do just about the same things with both, it's almost a case of style.
>
> To some extent, but it goes beyond just attributes vs. tags.  I actually
> don't care much if the value is an attribute or the contents of a note,
> except that an attribute allows you to use /> and is thus more compact.
>  But that's not the primary debate.

Agreed on that. Just again, illustrating a counter example.

>
>>
>>>
>>> One thing worth nothing is that, as of this moment, the Datacasting team under Andrew Bingham has the attribute method working and implemented in the Datacasting Feed Reader (http://datacasting.jpl.nasa.gov). What we have discovered through customer use cases is that the custom metadata is of prime importance to users (as is somewhat predictable) but the burden on data providers to create conforming custom metadata is fairly high. We have seen data providers struggle with the implementation of custom metadata even in the "simpler" case of the attribute lists -- and if they are required to auto-generate XML schema and update those schema as custom metadata is added/removed it will likely further encumber the process of custom metadata injection.
>>
>> I'll agree with that.
>>
>> In the case of some other projects I've seen (e.g., PDS), some of the data providers are willing to take this on but I can certainly see how it's an issue beyond traditional inlining.
>
> PDS has different requirements, being a long-term archive mechanism.

Agreed, but they don't have different requirements when it comes to XML Schema and where they are going with PDS4. They have the exact same scenarios that data casting does.

> They do not expect changes, and thus static schemas are not as hard to
> deal with.

That's not true. They expect changes all the time. New missions define new elements and extensions of the existing PDS data model.

> We should expect changes... we should encourage data
> providers to add new and interesting metadata all the time.

Agreed!

>  It also
> takes many work-months to properly prepare data for PDS (I know from
> experience!).

As do I :)

>  That's far too heavyweight for this application.  PDS
> also does not have a convenient "channel" they can put the m-a-t-m in.

They do in PDS4 and are working towards it.

>>>
>>> Basically, I think schema should be relatively static documents describing the structure of the file, rather than highly dynamic documents that change with every revision of every feed.
>>
>> Well that's the thing -- don't count schema out though. You can do some pretty advanced things with schema, even having optional elements, etc.
>>
>> Paul Ramirez and Steve Hughes and the Data Design working group are doing this for PDS right now and they have similar issues.
>
> Yes, but they also have different requirements.  While it would be nice
> to use the same mechanism, I don't think it's practical.

I think it *is* practical, and what's more, they're using a standard rather than a home-brew.

>
>>
>>>
>>> If we do decide to go with the schema method, there must be a namespace dedicated ONLY to custom elements - nothing else.
>>
>> I think there's a different way to do it so that custom/dynamic elements can be leveraged. Maybe we can ask Paul to join in on the conversation.
>
> Care to share?  ;-)

Heh haven't had a chance to talk with Paul more about it, but will definitely share if I do! :)

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann at nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Phone: +1 (818) 354-8810
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++