Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

SUO: An article on the pitfalls of metadata




At 08:07 2003-08-23 +0100, West, Matthew R SITI-ITPSIE wrote:
> 
> Dear Rich,
> 
> Well the article makes me realise that I am involved 
> in one of these initiatives in EPISTLE where we have
> a data model and some 50,000 items of reference data.
> So some comments from the coal face.

Matthew-

I agree with your points and I'd like to add some others.  As you know, I'm involved in the ISO metadata standards committee (ISO/IEC JTC1 SC32 WG2).  So here is my two cents' worth.

In the referenced article:

        http://www.well.com/~doctorow/metacrap.htm

I agree with *some* portions of what Cory Doctorow said, but I believe he said it imprecisely.  Here is my summary of "what is wrong" (i.e., the "crappy" things ...

- Everyone has a "search engine" mentality when they think about things on the web.  All their illustrations are from search engines.  A typical comment is "when I look for something I want to get 10 hits rather than 10,000".  The web isn't just about searching.

- Adding "descriptive data" (i.e., "metadata") has some cost associated with it.  Based on that cost and the quality one desires, "descriptive data" is added at the appropriate "quality level" (the notion of "appropriate" depends upon one's notation and expectation of "quality").  Unfortunately, there is no agreement on cost and quality, so from everyone's perspective, the "descriptive data" is inconsistent.

- For a given object, there isn't a singular set of "descriptive data".  This should come as no surprise that "metadata is in the eye of the beholder".

- Most web search engines are terrible ***from the perspective of a librarian or someone creating catalog entries***.  For example, let's say I have a document that has the following unordered attributes:

        "attribute1=value1, attribute2=value2, attribute3=value3"

There is no reliable way to get this descriptive information into the web content so that a user can say (with a different ordering of the attributes):

        "please find attribute3=value3, attribute2=value2, attribute1=value1"

Search engines ignore "META" tags because people lie.  So even if we had high quality metadata, it would be useless for web searches.

----

One of the points in the paper is: "metadata is data about data".  This is wrong -- pretty much all data is about data, so that definition doesn't reveal the delimiting (or essential) characteristics of "metadata".  Although it sounds catchy, it is an imprecise definition.  A better definition is:

        metadata: data that is used for description

So an *essential characteristic* of "metadata" is that it is descriptive.  This also means that one cannot look at something *in isolation* and know that it is "metadata" -- "metadata" only exists in relation to something else (the object of description).

So nothing is inherently "metadata".

Here are the seven "insurmountable obstacles" of "meta-utopia":

        1 People lie
        2 People are lazy
        3 People are stupid
        4 Mission: Impossible -- know thyself
        5 Schemas aren't neutral
        6 Metrics influence results
        7 There's more than one way to describe something

If one considers the creation of "descriptive data" as a data collection process, then the usual features of measurement, observation, error, etc. apply (think of a survey for data collection).  So problems #1, #3, and #4 are simply errors in the data (Cory's illustration of the Neilsen television ratings demonstrates this point).  A statistician might characterize the kinds of errors differently.  Statistical methodology is a prime tool for improving metadata quality.

Statistical methodology might help #6 and #7: the results of a survey depend upon how you ask the question(s).  However, aside from the error estimation of (descriptive) data collection, most systems that support the collection of metadata also support the collection of more than one taxon, characteristic, property, etc..

Even if the hierarchies are different, if the "terminology" is the same (or a "population" and its "characteristic" are the same; or the "data element" definition is the same), then it is possible to have meaningful data interchange.  To use the example from the paper (problem #5), one vendor organizes data with the following hierarchy:

        Energy consumption:
                Water consumption:
                        Size:
                                Capacity:
                                        Reliability

while another organizes it the following hierarchy:

        Color:
                Size:
                        Programmability:
                                Reliability

So as long as there is common agreement on what "reliability" is and how it is measured (e.g., a standard), then the incompatible hierarchies are not a problem -- and it is possible that the properties (values) associated with the characteristic "reliability" are compatible.

In conclusion, nothing is inherently "metadata", it is merely data.  Data is only metadata when there exists a descriptive relationship to some object.  Since the establishment of a (descriptive) relationship to an object is not in question in Cory's paper, the only thing that is left is just the (descriptive) data.  Thus, it is not surprising that techniques for improving data quality are similar to techniques for improving metadata quality: e.g., we need precise definitions of the meaning of the data (such as data semantics in ISO/IEC 11179) and we need a precise understanding of its measurement and error estimate <-- these apply to both data quality and metadata quality.

I don't see these problems as insolvable -- I see them as manageable, like all other aspects of data, measurement, data collection, and data quality.  I don't expect perfection.  I don't write "perfect" code (whatever that is), I write "well-engineered" code.  Likewise, I don't expect "perfect metadata", but I believe that "well-engineered metadata" is possible using the techniques of subject-matter methodologies, engineering methodologies, and engineering management.

Not to mention, there are some ISO/IEC standards (11179) and technical reports (20943) on this topic of "metadata" that were developed by ISO/IEC JTC1 SC32.

-FF

______________________________________________________________________
Frank Farance, Farance Inc.    T: +1 212 486 4700   F: +1 212 759 1605
mailto:frank@farance.com       http://farance.com
Standards/Products/Services for Information/Communication Technologies