[Esip-discovery] Datacasting custom elements proposal
Mccleese, Sean W (388A)
sean.w.mccleese at jpl.nasa.gov
Sat Mar 19 04:48:59 EDT 2011
Bob Deen (a Co-I on the Datacasting project) wanted to weigh in on this subject but he has yet to join this list. Don't worry, I'm trying to pressure him to join, but for now I'm sending his reply for him and CC-ing him on it (as he requested).
His reply is below. Just reply-all until I can convince him to sign up ;)
On 3/18/11 1:17 PM, Mccleese, Sean W (388A) wrote:
> -----Original Message-----
> From: Mattmann, Chris A (388J)
> Sent: Friday, March 18, 2011 1:05 PM
> To: Mccleese, Sean W (388A)
> Cc: Hua, Hook (388C); esip-discovery at lists.esipfed.org
> Subject: Re: [Esip-discovery] Datacasting custom elements proposal
> Hi Sean,
>> The primary difference between<datacasting:customElement="maxWindSpeed"> and<datacasting:maxWindSpeed> is that the latter requires datacasting:maxWindSpeed to be declared in an XML schema and the former does not.
> Theoretically yes. Practically, no.
> I can easily come up with an example, e.g.,:
> <?xml version="1.0"
> <datacasting:elem1>yadda yadda</datacasting:elem1>
> To validate the elements against the above, ideally, yes, the fake URLs I inserted should be real, and should reference some .xsd. In practice, though, they don't need to be real, in order to have this been a valid XML document (and, even in some cases, a valid feed, consumable by downstream OTS aggregators).
Ah, you're making our argument for us. ;-) The whole idea is that you
have to store the metadata about the metadata somewhere. If not the
channel metadata, then in the schema. And if it's possible, and even
likely, for the schema to not even be there, then we have a problem.
Besides, you're encouraging XML that will not validate. Having a static
schema means that it WILL validate, without having to worry about the
data provider maintaining a separate schema file.
There are really two concepts here that are being conflated. One is the
idea of where the metadata-about-the-metadata goes. The options there
are in the schema file, or in the channel. The second concept is
whether the tags themselves are schema-driven (datacasting:windSpeed) or
keyword-driven (datacasting:customElement name=windSpeed).
Taking just the where-does-it-live concept, if the schema language does
not allow additional metadata-about-the-metadata to be added such as
conforms-to-standard, processing-method, etc., then you pretty much
*have* to describe the m-a-t-m in the channel. As far as I know, schema
languages allow you to describe the structure, but not so much arbitrary
m-a-t-m outside of comment/description fields, which are not
machine-parseable. Given that, it would be far simpler to have *all*
the m-a-t-m (including name and data type) in the channel, rather than
having to look in two places. Plus, the channel is far more convenient
in that it's always there, it's attached to the feed itself, there's no
separate file to load, and thus no ambiguity or possibility for error in
terms of applying the proper m-a-t-m to the metadata.
Taking just the schema-driven vs. keyword-driven debate, that is an
endless one. ;-) To some extent it doesn't matter, if you don't care
about validation. But is it wise to not care? It seems like we should
not be advocating a system where it is easy (and practically speaking,
common) to create unvalidateable XML. That means we have to insist that
everyone provide a schema in the schema-driven method, which as we all
know is unlikely to happen properly. In the keyword-driven method,
nobody except this group has to worry about creating and managing
schemas; data providers can simply concentrate on their data (and metadata).
It's unquestionable that the schema-driven format is easier for humans
to read, and somewhat more compact. But I think those are second-order
criteria, not drivers here.
>> Declaring something in a schema is a fairly heavyweight operation, especially because most schema definition languages can be fairly opaque.
> I'd generally agree with this, but the heavy-weightedness comes on the definition side, not on the consumer side, as validation can be turned on and off by the actual XML parser .
Depends on how you define it. I see data providers as "users" of the
casting service that we are trying to define. Therefore it IS a
responsibility of the users to worry about the schema. It's true that
the final end-users don't have to, and that data providers should be
more sophisticated than end users. But in our limited experience, data
providers have trouble even following the datacasting spec properly.
Especially at first, they're likely to see this as a sideshow, they'll
get it up but aren't that enthusiastic and thus pay more attention to
the "main" data. So we'll get a lot of "wrong" feeds. That will in
turn cause people to not trust the casting mechanism, which could hurt
our user base. It really does need to be As Simple As Possible for data
providers to encourage adoption.
>> Not only that, but you have to make the schema available independently via a long-lived URL.
> I can't think of one of the top of my head, but I think you can also somehow bake in those definitions to the file (or feed in this case) itself. Sure, DTDs are sexy from the perspective that you can inline them, but I'm sure XSDs have the same ability. Or, maybe I'm wrong :)
Well unless you're sure, that's not an argument we can hang our hat on.
;-) Even if so, the schemas I've seen are universally ugly and
including them inline in the feed just doesn't taste right.
>> It also means that every client has to read the schema in order to determine simple things like data type.
> How is data type a simple thing, just by referencing it in an attribute?
> Are you proposing that just because you say:
> <datacasting:customElement="maxWindSpeed" type="real">22</datacasting:customElement>
> That that's simpler than:
> And then knowing (via schema) that windSpeed is a real?
You're conflating the two concepts I talked about above. The type would
not be in the item, it would be in the item definition in the channel.
And yes, that's far simpler, because you have to parse just one document
(the feed) and you can completely ignore the schema.
Data type is critical because general-purpose applications will need to
work with unknown metadata. That's the whole point. If we had to
change clients just because someone added windDirection to the existing
windSpeed, we're not going to get anywhere. Clients have to change to
add new data *types* of course, but there are far fewer of those.
Knowing the data type means the generic client can put up appropriate
gui's and use appropriate comparisons for the metadata.
>> With the attribute method, quick-and-dirty clients (e.g. python scripts, etc) could ignore validation entirely, just assume the XML is correct, and extract data.
> They could in either sense. You just turn off the schema validation in the quick and dirty client parser.
True. But you lose data type info that way.
> In addition, if validation is not a concern, then why do we care about the above?
Why would it NOT be a concern?
>> This is not possible using the schema method because data type must be buried in the schema, meaning parsing of it is mandatory to do anything general with the data. In the keyword case, you can ignore the schema. While that's not recommended for production clients, we should not ignore the one-off script kinds of uses.
> I don't think I'm proposing that.
Where are you proposing the data type be stored then?
>> Furthermore, in a lot of situations errors/mismatches in the schema will be reported by some deeply-embedded part of the XML parsing stack which is likely to make it harder for clients to parse& present these errors in an intelligible way to the user(s).
> I agree that the errors may be embedded in the XML schema library that's true, but I'm not sure that it's harder for clients to parse and present the errors. There are tons of schema libraries in just about every language that provide data structures, control flow, etc., that can deal with unmarshalling those errors.
It's certainly possible, but harder. In the keyword-driven case, you
know that any XML structure error is a Very Bad Thing and you can be
justified in rejecting the document out of hand. Errors in metadata
keywords would show up as lookup failures against the metadata table and
could be easily ignored, on the theory that the rest of the data is
probably still good. In the schema case, you'd have to separate the two
based on XML library messages.
>> Basically this would turn custom element metadata errors into structural errors rather than data errors. This whole thing basically boils down to the question as to whether custom metadata is a "data" issue or a "structural" issue.
> Well the debate of attributes versus tags has existed in XML since it's beginning. In reality you can do just about the same things with both, it's almost a case of style.
To some extent, but it goes beyond just attributes vs. tags. I actually
don't care much if the value is an attribute or the contents of a note,
except that an attribute allows you to use /> and is thus more compact.
But that's not the primary debate.
>> One thing worth nothing is that, as of this moment, the Datacasting team under Andrew Bingham has the attribute method working and implemented in the Datacasting Feed Reader (http://datacasting.jpl.nasa.gov). What we have discovered through customer use cases is that the custom metadata is of prime importance to users (as is somewhat predictable) but the burden on data providers to create conforming custom metadata is fairly high. We have seen data providers struggle with the implementation of custom metadata even in the "simpler" case of the attribute lists -- and if they are required to auto-generate XML schema and update those schema as custom metadata is added/removed it will likely further encumber the process of custom metadata injection.
> I'll agree with that.
> In the case of some other projects I've seen (e.g., PDS), some of the data providers are willing to take this on but I can certainly see how it's an issue beyond traditional inlining.
PDS has different requirements, being a long-term archive mechanism.
They do not expect changes, and thus static schemas are not as hard to
deal with. We should expect changes... we should encourage data
providers to add new and interesting metadata all the time. It also
takes many work-months to properly prepare data for PDS (I know from
experience!). That's far too heavyweight for this application. PDS
also does not have a convenient "channel" they can put the m-a-t-m in.
>> I would contend that self-describing the metadata within the RSS/Atom's channel metadata and utilizing these definitions through the tag attributes we alleviate all of these problems while maintaining the required functionality. It even fits with established Atom concepts (e.g. the "link" tag).
> Sure, fair enough.
>> Basically, I think schema should be relatively static documents describing the structure of the file, rather than highly dynamic documents that change with every revision of every feed.
> Well that's the thing -- don't count schema out though. You can do some pretty advanced things with schema, even having optional elements, etc.
> Paul Ramirez and Steve Hughes and the Data Design working group are doing this for PDS right now and they have similar issues.
Yes, but they also have different requirements. While it would be nice
to use the same mechanism, I don't think it's practical.
>> If we do decide to go with the schema method, there must be a namespace dedicated ONLY to custom elements - nothing else.
> I think there's a different way to do it so that custom/dynamic elements can be leveraged. Maybe we can ask Paul to join in on the conversation.
Care to share? ;-)
>> That allows predefined structural things like datacasting:dataSource to be added in the future, which is not possible if you simply reserve some names out of the namespace and then open it up to the world. By the same token, the schema method really requires each feed or set of related feeds to have its own namespace, to avoid any possibility of collision... because collisions have structural implications (e.g. the data type might be different). Chris has asked to see some data on collision probability and I think given the schema implications that's an excellent suggestion.
>> Furthermore, only one group can "own" a schema at a time, so you cannot have multiple groups sharing one schema as they would have to do a lot of coordination for updates. With the keyword method, collisions could occur but not a problem because the metadata is already isolated within the channel.
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann at nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> Phone: +1 (818) 354-8810
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
More information about the Esip-discovery