[Esip-preserve] Folksonomies

Tue May 24 10:42:27 EDT 2011

Here's a bit deeper way of formulating the problem of searching for items.

Suppose we have a collection of N objects - usually meaning files in Earth
science
data collections.

If we have a user who wants a particular item and there is no other
information
available, then the probability of that user selecting the desired item
is 1/N, so the entropy of the selection is log_2(N).  After the user finally
finds
the item, the information he or she gains is the reduction in entropy,
log_2(N).

Suppose the user has Q questions whose answers will reduce the range of
uncertainty with each answer.  The average information gain per question is
log_2(N)/Q.

What we're after is some reasonably efficient way of helping the user
structure
the questions so that Q is as small as possible.  This means that the
average information
gain per question is as large as possible.

There are two pleasantries involved in arriving at a satisfactory solution:

1.  The Web site designer has a vocabulary, V of terms the designer expects
the user to know.  On the other hand, there are C communities of users, each
having a different mental model - with each mental model having its own
vocabulary,
V_C.  It seems reasonably clear that if V does not match V_C, the user is
likely
to be confused and have more difficulty finding a desired item.  In
information
theoretic terms, the information gain per question for the confused user is
lower
than the information gain per question if the user had the same mental model
and vocabulary as teh designer.  To put it a bit differently, the user has
to do
work to understand the search designer's vocabulary and that additional
workload
serves as a source of "market friction" when it comes to having the user
want
to "buy" the search model designed into the site.

Note that Web site bookmarks serve as a way to allow users to reduce the
number of pages they have to visit on later visits.  We can think of these
as
allowing users to reenter the site and use Q' page visits instead of Q
(assuming
that Q' < Q).

An ontology is probably similar - the major difference being that the tool
designer
is assuming responsibility for understanding the user vocabulary well enough
to
efficiently narrow the choices.

2.  A Web site may be regarded as a graph.  Each path through the graph will
be associated with a profile of information gains for a particular kind of
user.
You can think of the user's path from one page to the next as selecting
a question and the next page as having an "answer" that provides a smaller
selection of items that might help the user find the desired item.  [This is

probably formally equivalent to a sequence of database queries that become
increasingly refined.]  The question for the designer is how to help the
user select
the best entry point to the Web site and to assist the user in finding the
most
efficient path for a suitable ensemble of user communities.

In dealing with this approach to designing a web site, it seems reasonable
to
assume that the vocabulary of a high school student looking for, say
pictures
of glaciers that illustrate changes for a class report will have a very
different
vocabulary than a scientific research team who want to quantify changes in
glacier area on the basis of photos.  It seems to me to be appropriate to
ask
whether the high school students and the researchers should enter the web
site at the same point (start on the same page, if you will) and whether
they
should be expected to traverse the same path to obtain their desired
information.

To put it a bit differently, do we have any empirical information about the
paths
different kinds of users take to get their information - and does that
information
(if it is available) suggest that one collection of vocabulary terms and
links
saves the users more time than some other selection of terms?  The
literature
on folksonomies appears to have concentrated on asking the question of
whether
the tags selected settle into stable patterns - and the answer appears to be
that
they do for most choices.  However, my impression is that they haven't asked
about the efficiency of search for a diverse collection of user communities.

I'll grant this answer is a bit formal.  If it isn't clear, let me know and
I'll see
if I can provide a clarification.

Bruce B.

On Tue, May 24, 2011 at 8:43 AM, Bruce Barkstrom <brbarkstrom at gmail.com>wrote:

> I had a chance to sort of skim the paper you sent.  I think my skepticism
> about
> ontology exercises goes much deeper.  Clay Shirky covers the argument
> fairly
> well in the attached web page; the other paper does a very nice job of
> providing
> a concise description of the library sciences approach - as well as some
> empirical
> evidence regarding data user tagging efforts.  There is also the work of
> Furnas
> and collegues at Xerox PARC which has always struck me as saying the
> professional catalogers don't capture the mental models of data users.
>
> Certainly medical terminologies are highly developed and used by experts in
> the disciplines.  Indeed, they are one of the key tools used in diagnosing
> diseases.  The question is whether that community's experience applies to
> Earth science, where much of the scientific work does not involve nearly as
> much classification effort - except, perhaps, in areas of biodiversity.
>
> I should also note that I've got samples of vocabularies used by NASA's
> Global Change Master Directory and the Climate and Forecasting Conventions
> developed by UCAR, as well as the WMO's nomenclature.  Even after removing
> case sensitivity, the number of exact matches on two sets of about 1000
> terms each is about five.  There are two sets of Essential Climate
> Variables
> (about 100 terms each) that show a similar degree of similarity.  Putting
> these
> lists together was not an encouraging exercise - at least personally.
>
> Bruce B.
>
>  On Mon, May 23, 2011 at 3:53 PM, Tom Moritz <tom.moritz at gmail.com> wrote:
>
>> Hi Bruce --
>> Been considering your recent messages and thought I'd send along a draft
>>  article (we never submitted it for publ...) from a few years ago when
>> I was still at AMNH in NY...
>>
>> We were grappling with the problem of how to render specialist
>> vocabularies
>> interoperable...  So (attached) for what it's worth... Perhap, food for
>> thought at least...?
>>
>> UMLS (NLM) is still to my understanding one of the most highly developed
>> and refined ontology projects...
>>
>> (In the mid 90's, I compiled the IUCN "Conservation Thesaurus"  and sought
>> to integrate it -- at a high level -- with a series of other thesauruses
>> -- including
>> UNBIS (the United Nations) , the OECD Macrothesaurus and INFOTERRA...
>> (UNEP)
>> -- this was just an early experimental effort...
>>
>> Tom
>>
>> *Tom Moritz
>> 1968 1/2 South Shenandoah Street,
>> Los Angeles, California 90034-1208  USA
>> +1 310 963 0199 (cell) [GMT -8]
>> tommoritz (Skype)
>> http://www.linkedin.com/in/tmoritz*
>>
>> “Πάντα ῥεῖ καὶ οὐδὲν μένει” (Everything flows, nothing stands still.) --*
>> Heraclitus *
>> "It is . . . easy to be certain. One has only to be sufficiently vague."
>> -- C.S. Peirce
>> *"Il faut imaginer Sisyphe heureux."  ("One must imagine Sisyphus
>> happy.") -- Camus
>> *
>>
>>       Please consider the environment before printing this e-mail
>>
>>
>>
>>
>>
>>   On Wed, May 18, 2011 at 10:41 AM, Bruce Barkstrom <
>> brbarkstrom at gmail.com> wrote:
>>
>>>  The new issue of Computer has an article by Sen, S. and Riedl, J.
>>> on Folksonomy Formation that is rather interesting.  They mention
>>> an web blog http://www.shirky.com/writings/ontology_overrated.html
>>> that's an opinion piece on ontologies versus user-based search
>>> mechanisms.  Has anybody else picked up on this thread of
>>> conversation?
>>>
>>> Bruce B.
>>> _______________________________________________
>>> Esip-preserve mailing list
>>> Esip-preserve at lists.esipfed.org
>>> http://www.lists.esipfed.org/mailman/listinfo/esip-preserve
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110524/7a8c061f/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tree.gif
Type: image/gif
Size: 278 bytes
Desc: not available
URL: <http://www.lists.esipfed.org/pipermail/esip-preserve/attachments/20110524/7a8c061f/attachment-0003.gif>