[Esip-discovery] time for new challenges?

Wed Dec 19 11:57:47 EST 2012

On Dec 19, 2012, at 6:58 AM, Lynnes, Christopher S. (GSFC-6102) wrote:

> Here is one not obviously related to OpenSearch...Faisal Hossain (this year's winner of the Falkenberg!) wrote an editorial in BAMS lamenting the difficulty of discovering useful web applications (e.g., Giovanni) via major search engines by the applications data content:  http://journals.ametsoc.org/doi/full/10.1175/BAMS-D-12-00035.1
> 
> Hook and I have talked with Microsoft, and indirectly with Google, and there appears to be no quick and easy silver bullet.

When you talked to folks about this, did they draw comparisons with other aspects of their search tools that index stuff other than the 'simple' web pages? For example, google scholar indexes a fair number of papers that are behind query interfaces. 

We (OPeNDAP) experimented with different ways to leverage Google to find our servers, with varying levels of success. Since most DAP servers have some .html pages, they can be crawled by Google (and by other, less well-behaved bots). We've embedded various UUIDs in Hyrax's top-level pages and can search for (crawlable) servers that way. However, because most DAP servers also return dataset metadata that's automatically generated on pages that can be reached by a crawler, bots are often blocked. This is a consequence of exposing large numbers of datasets (e.g., files) that are expensive to access (e.g., because the files are compressed as opposed to having the data they contain compressed). If only Google crawled a site with 10^7 files and decompressed each one, maybe nobody would mind, but when 50 bots do it, people get irked ;-) There are lots of crawling/indexing bots out there… 

I mentioned Google scholar and it's (apparent) indexing of a number of systems that (probably) implement different query interfaces because I wonder if we could use something like the opensearch, THREDDS catalog and WCS GetCapabilities responses to make web data access points findable. The crawler(s) would have to be smart enough to not descend into the holdings too deeply and we might have to modify those catalogs so they provide better information for crawlers.

> 
> This problem appears to be widespread, not just Giovanni. Is this something the Discovery Cluster should tackle?
> --
> Dr. Christopher Lynnes, NASA/GSFC, ph: 301-614-5185
> 
> 
> 
> _______________________________________________
> Esip-discovery mailing list
> Esip-discovery at lists.esipfed.org
> http://www.lists.esipfed.org/mailman/listinfo/esip-discovery

--
James Gallagher
jgallagher at opendap.org
406.723.8663