RE: SUO: RE: An article on the pitfalls of metadata
John Bateman wrote:
> >I have the database
> >files for WordNet 1.7.1 and WordNet 2.0, but the ontology
> >isn't made explicit there. Do you know of a reference URL
> >that describes it more explicitly?
> >
>
> I said this a year or two ago but this list is pretty
> cycle-tolerant it
> seems. The upper
> levels of WordNet were explicitly examined in a range of papers in
> the late 90s and were
> replaced generally in the EuroWordNet project because of their
> now generally quite well-known weaknesses. Why not look at the
> EWN hierarchies instead? Has the Wn2 hierarchy moved in this direction
> or away?
I'm looking for ways to automatically extract the ontology
of a WordNet release because WordNet is periodically updated
with more refined material. WordNet 2.0 was recently released,
so the extra material can be run through the analyzer again,
resulting in updated ontology information.
So adding EuroWordNet to the mix leaves me with two conflicting
sources of data, which I want to avoid within the dictionary.
The idea of some Englishy ontology word set is what appeals
to me about WordNet. I don't want multitudes of definitions
just yet because I want to drag corpora for more specific
information about how words are used in normal writings.
> The idea of creating upper level categories automatically (a) by
> looking at WordNet synsets, or (b) looking at corpus data for
> how words are used is an interesting suggestion. How about
> taking the entire discussion of that offline and reporting back when
> it has been done....
Presently, I have a program that reads the WordNet database
and builds a SQL Server database of the relations. That will
let me explore this concept a little more before committing
a lot of time to it. Progress will be reported if people on
the list are interested.
> How you might produce a FCA-derived lattice
> of the lot is pretty obvious (assuming you have a large enough
> machine sitting in the corner that you are not going to be wanting
> to do much with for a while...), what that lattice will look
> like will however be itself sufficiently in need of analysis that
> a few top-level hypotheses might well do a lot of good. But
> I am willing to wait for the outcome (i.e., this machine-derived
> lattice) rather than wasting too much time wondering
> about what it might be and whether it should be done. But
> if no one is going to do it, then why bother talking about it.
>
> (b) is pretty much what has been done / is being done for many
> years by COBUILD group in Birmingham, England. Their
> semantically-organized grammar is based entirely on this procedure,
> but without the machine-learning component. It hadn't occurred
> to me to classify this as an *ontology* activity however. Any
> corpus linguists on this list who might know what the task
> involves?
Bri'ish English and American English are distinct dialects,
so even the British National Corpus is operating from a
different cultural basis than American corpora.
> (a) does this have any possible advantage over (b) apart from
> the data being ready to process in various ways?
That seems to be a significant advantage, and I don't see (b) as
similar to (a). (a) is the "authoritative" dictionary of the
moment, while (b) is the way people actually write, turning
nouns into verbs, and making verbs into phrases, and so on. The
"AsIs" comes from corpora while the "ToBe" is the dynamic
authority's statements about the words.
In Jon Awbrey's terms of Anticipatory Systems, (a) is the model
that can be used for prediction, while (b) is the experiential
database of activities people actually do use.
> John B.
Rich