[Esip-discovery] time for new challenges?

Wed Dec 19 22:05:31 EST 2012

Hi Jeff,

On 12/19/12 9:13 AM, "jeff mcwhirter" <jeffmc at unavco.org> wrote:

>On 12/19/12 9:52 AM, Mattmann, Chris A (388J) wrote:
>>
>> SEO is an extremely difficult problem, couched in Information Retrieval
>> Research/theory. Most advances are wholly incremental or point solutions
>> that aren't widespread as of yet.  Most of that has to do with the
>>search
>> engine companies guarding their intimate optimizations and ranking
>>secrets
>> very closely.
>>
>
>I don't think it is really an issue of traditional search engine
>optimization but rather that there often isn't a crawable site.  It
>seems like most repositories are search oriented and what pages are
>there don't have much in the way of text corpus to index.   If the pages
>and the text aren't there Google isn't going to index it.

It depends on what you consider SEO -- I consider it to be a combination
of things like ranking, indexing and content/detection and analysis
strategies, etc. I would also say there is plenty of text corpus to index
-- and it's not just text anymore that's part of SEO. Many other facets of
the documents are in use, and have been in use for a while (e.g., when
crawlers detect headings, and lists, etc. as they do with most document
types now, due to parsing libraries).

Here's a paper that Paul Ramirez and I wrote and published in IEEE IRI in
2004 about the content and information retrieval portions of this:

http://sunset.usc.edu/~mattmann/pubs/ACE.pdf

In terms of pages being search oriented, most modern crawlers actually do
a great job of using filtering strategies, and URL deduplication, etc.,
for handling "deep web" content, driven by search.

Cheers,
Chris