Comments from reviewers for Kuroda, Kazama and Torisawa (2010), NLPIX 2010 paper

Review 1 (seems like Adam Kilgariff)

This is a very interesting experiment. It is great to see people 'looking inside' the data of a large-scale work like this.

Some comments on your sampling: Section 2.2 is not clear: first you say you had 33m nouns, then you say you selected the most frequent 100m.

You then say you had 430m dependency triples - which works out at an average of 8.6 triples per noun, based on 100m nouns, or 25, based on 33m nouns. Either way, it is not much data to base the clustering on, specially since the distribution will be Zipfian so most of the tokens will be accounted for by a small number of types.

I always find it useful to take a stratified sample, with some data from each of high, medium, and low frequency items: they typically have quite different characteristics. I suspect that most of the 'bad' data (categories e, f, m, u, x, y) is based either on duplicates in the corpus, or extremely little evidence, in terms of numbers of dependency triples that each word participates in. How many triples were the lowest-frequency of the 150,000 sample based on?

I do appreciate the vast amount of work in classifying 150,000 pairs! How many person-weeks was it? For the next round, it will probably be a good idea to look more closely at a carefully sampled set.

It is helpful to be able to see the sentences that give rise to similarity and dissimilarity: then, most instances of similarity can promptly be understood. There is a similar distributional thesaurus for Japanese in the Sketch Engine (see which will make an interesting point of comparison, and where the underlying sentences can immediately be checked.

The statement that 'this confirms the distributional hypothesis' is rather blunt!

The paper needs a careful edit, see eg: Typo: Toyofvta. Section 2.3.3, end of first para - you have two negatives, delete 'no' or change 'unavailable' to 'available'. You use the single-letter classes frequently in the discussion (and only introduce k** quite late). I had to keep returning to the catalogue. Longer mnemonic class names might be a good idea.

Review 2

It classified the pair of terms extracted by distributional similarity into 18 categories. This category hierarchy is very interesting. This hierarchy and its counts are worth to see for many researchers in the field, but there are many unclear parts in the paper. In general, the paper has to be revised with careful attention and I recommend to submit to other conferences.

First, the explanation of data preparation is mentioned in (1) in section 2.1. In particular, the description in b is the essential, but I can't understand this.

"Construct a set P(k) of pairs pi,j = (ti, tj) so that ti and tj are two of the k-closest terms in T."

T is the set of terms, and you select ti and tj which are k-closest (later, you set k=2, so 2-closests. So, you will have two pairs terms, say (t1, t2) and (t3, t4) which are the closest among T. (i guess I don't really understand k-closest means. Closest to something specific or closest each other?) Also, the set P(k) is singular (as it has "a") so the result is just a single set, right? Then, beginning of the next page said k=2, but you will choose 150,000 terms. How you can choose out of 4 terms (t1, t2, t3, and t4)? I think this is only the description, but it has to be much clearer.

Related to this, I don't understand the description of (2) and (3). what do you mean by 16 terms of piano *at rank 1070*? There are more than one examples at a single rank?

You mention that there are 33M nouns, but i the next sentence you said that you choose most frequent 100M. Why the later is bigger??

In page 3, "the behavior of these terms is more or less predictable...". What do you mean by "behavior"?

In section 2.3.1, d "there is a certain more or less concrete enough class". This expression is not scientific at all. You use "more or less" a lot of time, which should not be used so often in a paper.

In section 2.3.3, you give unacceptable excuses that you did not do inter-annotator agreement. You annotated 5000 or more, you should be able to do it. This is very important.

The last paragraph of 2.3.3 is not understandable.

In 2.3.4, You said "the distinction is sometimes obscure between h and p" because "Japanese has no singular-prulal distinction". I don't understand this relationship.

At the end of 3.2, "the ratio of allography is worse in predicate". What do you mean by "worse"?? Is it "small"?

In 4.1, "Admittedly, 6.31% is not a large number". But you just explain that "6.92%" is very large.

In Table 1, the total number of count is more than 250K. This is very big. Did you really do the analysis by hand? Did you count the frequency it based on the same word pair (i.e. if you find (t1, t2) are category x, then you assign the frequency of (t1, t2) pairs in the corpus to category x ?) If so, it is not a good statistic. At the beginning of 2.3.2, you mentioned that you analyzed 5,000.

"Discussion" s not really discussion. It (again) gives excuse for the things you have to do, but you did not do. As you have so many data, you should be able to do such analysis on the data. (4.1 and 4.2)

You said over 90% are "semantically similar". How do you get this number. As far as I see, the categories "o", "u" "e", "m", "x" and "y" are NOT semantically similar. also, at the beginning you filtered out some pairs (page 3, left top a) and b); maybe b) can be excluded, but not sure about a).

Again, I like the classification hierarchy, but the paper has to be clearer. i recommend you not to publish this time, but rewrite the paper (and do some analyses) and resubmit to other conference.

Review 3

This paper presented a study of distributionally similar words. It classified the distributionally similar words into several categories. I am not aware of previous studies on the detail statistics of different types of semantic (or non-semantic) relations between distributionally similar words.

The results seem to indicate that most of the distributionally similar words are semantically related in some way. However, the statistics for the study used only k=2 which are highly related. It is not clear at all whether the conclusions still hold when k is even slightly larger.