Cotterell et. al on morphological complexity

The first paper I want to talk about is ‘On the Complexity and Typology of Inflectional Morphological Systems’ by Ryan Cotterell, Christo Kirov, Mans Hulden, and Jason Eisner (see the pdf on arxiv.org for the figures). First, I will try to express my understanding of the argument in language as non-technical as possible, and then I will rant about its perceived shortcomings. It is always hard for me to return to this in an appropriate moment, so I will say upfront that this is a great paper, which I am very excited about. It does a good a job of reasoning about complex things in a clever and mostly clear way and is a joy to read. With this taken care of,

The gist

The baseline

The general aim of the paper is to ‘quantify the linguistic complexity of different languages’ morphological systems’. Of course, this is far from the first attempt at an analysis of this kind, and the paper is actually more narrowly focused: it builds on the work of Ackerman and Malouf (2013)¹ by using largely the same conceptual framework and testing the same hypothesis. However, this is presumably done now in a better way, where ‘better’ means more mathematically rigorous, more widely applicable, and with fewer assumptions about language baked in.

Ackerman and Malouf’s approach is based on the opposition between two types of morphological complexity: e-complexity (from ‘enumerative’; this is simply a number of slots in verbal, nominal, etc. paradigms) and i-complexity (from ‘integrative’, a.k.a ‘system complexity’; this is a measure of how predictable/regular are those paradigms).

Like basically everybody else, Ackerman and Malouf base their definition of regularity on the information-theoretic notion of entropy introduced by Shannon. Shannon’s entropy is based on probability distributions: given what you know about a part of some system or about previous outcomes of a process, the system/process is predictable (has low entropy) if there is a sharp difference between the probabilities of the options for the remainder. For example, if your grandma is a wonderful cook, dinners at her place constitute a low entropy process from the point of view of her guests’ satisfaction. If, on the other hand, you repeatedly toss a fair coin, no matter how many times you observe the previous results, you won’t be able to predict the next outcome with a greater then 50% probability. A fair coin is a maximum-entropy process with two outcomes.

The crux of Cotterell et al.’s argument is that Ackerman and Malouf did a poor job of precisely defining the i-complexity of a language’s morphology. However, the reader will have to wait for a rather long time before they find out how A&M actually did this and what went wrong because the NLP community has a weird, to an outsider, custom of discussing related work right before the conclusions, as a kind of afterthought (in this case, a compromise was reached so the scholars first present their own model, then lambast A&M, and then report their experimental findings, which are consistent with A&M’s predictions). I am used to starting with discussing previous work and then improving upon it, so I’ll fast-forward to Section 5, ‘A Methodological comparison’.

In order to define i-complexity, Ackerman and Malouf consider conditional probability distributions of form in paradigms depending on other elements of the same paradigms. For example, given an English verbal paradigm ‘kill, kills, killed, killing’ they will estimate the probability of the event that this verb has the past form killed given that we know that its present 3sg form is kills, the probability that its basic form is kill given that its gerund is killing, etc. The other options here are other possible inflections for the same form. This rarely happens in English, but killed could’ve theoretically had the form killt in the past tense so, given some probability measure, the probability of killed given kills ( $p(killed|kills)$ in mathematical notation) will be high, but not exactly 1.

As a semi-aside, one can say that from the point of view of classical frequentist statistics this is, of course, nonsense. We know all the forms so p(killed|kills) is not defined because no repeated experiment is possible, except maybe in the case of a handful of variable verbs. What we can do, however, is pretend that all the words we want to measure the paradigmatic complexity of are new and have just entered the lexicon. How will their paradigms be structured? Again, for English and many other languages, we actually know what’ll happen: new words often are obligatorily put in some regular inflectional category. But what if we didn’t know this and only had the extant lexicon to go on? (We could also imagine a person learning the language and not knowing about the regularity constraints; a learner’s language model and a speakers community’s language model are not necessarily the same thing, but whatever.) Then we would be able to estimate conditional probabilities based on raw frequencies and other properties of words. As it often happens in linguistics, in order to be mathematically rigorous one usually has to ignore a huge chunk of actual lingusitic knowledge.

Okay, so we measured the probabilities. Now we can measure pairwise conditional entropy between pairs of paradigm forms, which is a rather simple calculation, and then we can divide the sum of these pairwise entropies by the number of form pairs in a paradigm, and there you go, you’ve measured the complexty of this paradigm. Then you can measure complexities of a number of paradigms, average the result, and you have an idea about your language of interest’s i-complexity. Add average e-complexity of the same sample of paradigms and serve as a scatterplot. Ackerman and Malouf predict that no language will have both high e-complexity and high i-complexity; in other words, your scatterplot should have an empty upper right corner. Cotterell et al. agree.

The contribution

How can we do better? One thing to do is to find a better probability measure, i.e. one which will allow any slot in the paradigm to take on potentially any form and make it possible to compute the joint probability of all the forms in the paradigm. Also it should make some linguistic sense.

The requirement that absolutely any sequence can serve as a slot in a paradigm with some (infinitesimal) probability of course makes it impossible to compute the entropy of any form directly: in order to compute entropy we need to know the probabilities of all possible outcomes, and here all the outcomes are an infinite sequence of character-strings (killed, killt, killings, likt, xxx, dfasfasd, whereami, …). People don’t resort to primitive enumeration of suffix substitutions for no good reason.

If you cannot compute something directly, put a bound on it. Cotterell et al. look for an upper bound for the paradigm entropy (not the bound, mind, that they are trying to prove exists, only the i-complexity bound; so technically no circular reasoning here). Okay, where do we find this bound?

An upper bound

We use cross entropy, i.e. a measure of how well we can predict outcomes of a process (or elements of a system) using our knowledge about some other process/system which has the same elements/outcomes but with different probabilities (e.g., we can try predicting outcomes of tosses of a loaded coin based on our knowledge of how a fair coin operates). The best we can do here is use exactly the same process as our reference point. In this case we will have to deal only with this process’ unpredictability/entropy. Any other process will do worse. Consequently, if we find a better-understood process/system and compute its cross entropy with the process/system of interest we will have an upper bound on the process of interest’s entropy. Way cool, but we still need to know all the probabilities for the base process…

First we approximate to find a probabilistic model of the second system we want to use because it too must allow for an infinite number of possible forms for each element of the paradigm. How do we do this? By taking a sample of paradigms and choosing some model that will (i) best fit these forms and (ii) will describe them probabilistically. Easier said than done: here Cotterell et al. get the most creative and the most technical, and we’ll give their approach the attention it is due in a sec, but first we must approximate the calculation of the cross entropy itself.

Remember that we somehow have just chosen a probabilistic model of the morphology of a language based on a sample. Then we can take another sample, pretend these new words have just entered the lexicon, and try to measure the cross-entropy of our newly-minted description with the ‘true model’ of our process. Here we will need a formula for cross entropy, which is as follows

The m’s in parentheses are different word-forms taken together as a paradigm, p’s are ‘true’ probabilities of full paradigms, whatever that is, q’s are our approximated probabilities for the same paradigms, and we ideally should try all possible ways to fill these paradigms. However, if we fit our approximate model on some sample of real paradigms and then try to predict only other real paradigms we will still have a decent approximation of the entropy because un-real paradigms should ideally have very low probabilities and will make a paltry contribution to the cross-entropy.

Moreover, if the second sample is large enough, what these $p(m_1, \dots, m_n)$ factors will do is essentially average the values of the $\log{q(m_1, \dots, m_n)}$ factors. Therefore, we can say that we will have a decent approximation of the cross-entropy by simply computing

for the sets of forms $m_1, \dots, m_n$ from the test sample. You see? No p’s any more! We not only successfully approximated away the infinity because the contribution of most of the string-sequence space to the end result is presumably negligible compared to the stuff we can actually observe, but also got rid of ‘true’ probabilities we don’t know because they will essentially be averaging things out and we know how to compute an average straight away.

An approximate probability measure I: paradigms as trees

Okay, so all we have to do now is to find q—the way to assign probabilities to the paradigms from the test sample based on what we saw in the training sample.

We found that modelling paradigms as a collection of peer forms doesn’t make much sense. What is the alternative? Make them look like trees. Start from the basic form and then either derive everything straight from it (kill → killed, kills, killing) or use some intermediate form (Hand → Hände → Händen). The only restriction is that any form should have only one parent-form; the starting form doesn’t have a parent. Sometimes it makes sense to describe actual verbal/nominal forms as a kind of cut-and-paste bricolage of other forms, but this is rarely necessary, and the tree approach seems very intuitive.

It is also very practical as it allows us to use to the wonderful theory of Bayesian networks, which tells us that if different interdependent things form a tree-like structure there are relatively simple ways to reason about them. A slight complication is that trees have different shapes, and there is no way to know in advance what kind of tree is best suited for a particular paradigm. Fortunately, there is an algorithm for that, which can, given a weighted directed graph (a collection of nodes and edges between them with some weights attached to the edges) and a root node, choose an optimal (in this case most probable) tree with that root. All there is to do is to apply our probability measure to all pairs of forms in a paradigm (in order to estimate how likely we are to derive killing from kill and not from killed, etc.) and to all forms individually (in order to estimate how well they fare ‘on their own’, i.e., as roots; this is a rather counterintuitive measure, but the trees must begin somewhere).

When we find a good tree for a paradigm, we can estimate the joint probability of this paradigm by applying the Bayesian-network method of computing the joint probability based on the transition probabilities, which we already know: they are the same edge weights we used to construct the tree in the first place. Whence the weights then?

An approximate probability measure II: neural network

Contemporary NLP is in some cases nearly synonymous with ‘sequence-to-sequence’ neural-network-based modelling: given a sentence in English choose the best corresponding German sentence, given a misspelt word choose the most likely correct word, given a sequence of letters or phonetic symbols choose the most likely pronunciation, etc. There is a fine line here between choosing and generating: sequence-to-sequence neural networks do not really choose from an infinite variety of possible translations for a given English sentence, they choose a most likely starting word given the input, then choose the most likely next word given the previous generated one and the input, and continue in that manner until they decide to choose the token corresponding to the end of a sequence. In the process, they assign probabilities to all the options at each position: when translating ‘I like apples’ into German word by word, a well-trained neural network will start with Ich because this is the most likely opener given an input, but it will also think about the suitability of every other German word it knows in this position and will assign probabilities to all of them.

Consequently, we can, given some English input, give a translator neural network a random sequence of German words of any length and ask it how likely it is, in the network’s opinion, that this sequence is a correct translation. If instead of training a network word by word we train it character by character, we can give it any alphabetic sequence of any length, and it will assign a probability to it, which is what we needed all along.

But why should we trust neural networks (basically collections of matrices plus some code to multiply the matrices together) to provide correct answers to our questions? Well, mostly because they proved themselves capable of more or less correctly answering some of our questions. Also we know that they are theoretically capable of answering any questions of a reasonable mathematical nature, but there is no guarantee that a combination of an actual network and a training data-set are fit for a particular task.

In this case, Cotterell et al. rely on the fact that a particular type of neural network, devised by Katharina Kann and Hinrich Schütze,² can look at paradigms of some words in a language and then make mostly correct predictions about paradigms of other words. More precisely, it takes a word-form together with a set of grammatical markers and another set of grammatical markers describing the desired result (along the lines of ‘Hand NOM SG → NOM PL’) and produces another word-form (hopefully ‘Hände’ in this case).

What about the roots of paradigms? Well, these guys are the most probable word-forms derived from empty strings. Again, pretty counterintuitive, but we must start somewhere.

First we train the morphological neural network on pairs of verbal and nominal forms from a huge sample of paradigms; Cotterell et al. took theirs from Wiktionary. We do this separately for each language, of which Cotterell et al. had 31 for nouns and 23 for verbs.

Then we take another sample of paradigms from these languages, convert each paradigm into a tree, compute joint probabilities of all the trees (taken individually), take the logarithms of these probabilities, sum them, multiply by –1, and divide by the number of paradigms in the sample. And we’re done, this is the i-complexity measure. Again, add the e-complexity established based on the number of forms nouns and verbs may take in a given language and serve as a scatter plot.

Cotterell et al. do this for nominal and verbal paradigms in different languages and observe that the top right corner in both scatterplots is empty. As we are doing linguistic typology and not some downstream-task-oriented NLP, we need a p-value. Cotterell et al. use a permutation test: verbal and nominal morphology of each language is represented as two numbers (i-complexity on the y-axis and e-complexity on the x-axis): if we randomly permute second elements of these pairs, we will have a sample of languages with random but kind of realistic i/e-complexity pairs, and we want to know how likely we are to see as empty an upper right corner as we do in our plots. The scholars use two different training schemes for neural networks and report p-values of 0.017 and 0.042 for verbs under these two schemes and p-values of 0.045 and 0.034 for nouns. Not stellar, but the universally coveted and hated 0.05 significance level is here. Case closed.

The rant

Beautiful, is it not? A reasonably approximated reasonable definition of a morphology’s entropy, automated paradigm-structure modelling based on logical assumptions, an efficient probability measure with an infinite support, a small, but proud language sample, even p-values.

Generally, I am not in a position to judge the technical side of the paper, but it definitely looks very impressive, and the paper’s reviewers seem to be of the same opinion. There are several things that slightly bug me, however.

One of them is that the ‘variational upper bound on entropy’ of a language’s morphology proposed in the paper looks very much like log-likelihood of the data from the test sample given the model fitted on the train sample. When you fit a model, you often do this by maximising its likelihood (or the logarithm of its likelihood), i.e., the probability of observing the data from the training set given the model. The paper makes this explicit in equation 5. So you optimise the model with respect to some widely used measure, then compute the same measure on the test set, and use this as an indicator of how chaotic your data are. Makes sense. And this garden-variety measure also happens to be a clever variational upper bound on the entropy of your dataset. Serendipity.

Also the p-values. There are two of thems, and it is usually not ideal. There are things in statistics that people repeat again and again, and they never lose their charm. So, p-values are uniformly distributed under the null hypothesis. Which means that if you performed an experiment an obtained a p-value of 0.05, under the null hypothesis there is a 1/20 probability that whatever you observed is due to chance alone. And if you then went and performed another experiment testing essentially the same thing and again obtained a p-value of 0.05? You had a 1/20 chance of obtaining an interesting result by chance in the first case and then a 1/20 chance in the second case, so what is the probability that at least one of your results is due to chance? If the two experiments you conducted are totally independent, that’s bad for you because your p-values are diluted by the factor of 2 and are in reality 1/10. Multiplying p-values by the number of experiments (a.k.a. Bonferroni correction) is an iron-clad way to make sure you don’t fall prey to multiple comparisons. But what if the experiments were not independent? (Imagine just calculating the p-value again: it will be a kind of a repeated experiment but with a totally deterministic result. No need to perform any correction in this case.) With nouns and verbs used separately in order to test for an upper bound on morphological complexity, there seems to be a multiple-comparison issue, and given that nominal and verbal paradigmatic complexity are not evidently correlated (r = 0.026, another trade-off here?), some correction was probably in order. And there are some pretty large p-values under both training schemes.

Also the paradigm roots derived from empty strings. We are given a nice way of approximating paradigms’ likelihoods given a trained probability measurer, where we choose an optimal rooted tree by multiplying the probability of the root being derived from an empty string by the probabilities of all edges in the tree. Now, the probabilities of edges are well defined: they correspond to implicit derivational/reinflectional rules, i.e., the morphology. But deriving roots from nothing is not morphology, it is just a weird prior distribution over the vocabulary (corresponding, perhaps, to phonotactics: how likely it is that a word of this shape exists in the lexicon).

Finally, I find it stylistically odd to consider a black-box neural-network-based probability guesser as a foundation for ‘a clean mathematical formulation of enumerative and integrative complexity of inflectional systems’, as it is stated in the conclusion. A clean mathematical formulation sounds like something based on a formula, but here it is more of a training algorithm that eventually converges to some formula, which is then impossible to reason about because it has more parameters than data-points in the data-set.

Now let’s switch to linguistics and talk about assumptions. Cotterell et al. proclaim that ‘the only true assumption we make of morphology is mild: we assume it is Turing-computable; that language is Turing-computable is a fundamental tenet of cognitive science’. I am wondering, however, how can Turing computability be the only assumption if Cotterell et al.’s algorithm imposes a tree-like structure on paradigms? Remember my saying that sometimes it makes sense to model, say, an Ablative-Plural form as the stem of Genitive-Singular + the suffix of Nominative-Plural or whatever? Nobody says that this is the way to go, but the tree-based approach prohibits this (probably because inference on more loosely-structured graphs is much more difficult).

While we’re at it, Cotterell et al. also assume that paradigms are finite. This is not a given: there are languages with highly compositional inflection with ‘paradigms’ comprising of millions or even a potentially unbounded number of forms when morphology follows some kind of word-internal syntax.

Not that this invalidates anything, though. Large/unbounded paradigms definitely must be regular in order to be understandable. Also, if there is a more clever way to structure a given paradigm and make it more regular, this will surely not invalidate the upper bound on the complexity.

The complexity, which is what, exactly? We started out trying to ‘quantify the linguistic complexity of different languages’ morphological systems’. Cotterell et al. note that, ‘While the regular English verb paradigm has three slots in our annotation, the Archi verb will have thousands (Kibrik, 1998)’, and then provocatively ask, ‘However, does this make the Archi system more complex?’

First, like, seriously? If you want to propose a unified measure for morphological complexity, please be sure to make it treat Archi as more complex than English. The comparison between German and Hungarian at the very end of the paper seems to be more fair.

Secondly, I would still like to see the actual answer to this question. Which I don’t see because the paper is not about quantifying the linguistic complexity. It is about splitting morphological complexity in two orthogonal measures and positing that English and Archi are at the same time incomparable (the set of (x,y) points on the Cartesian plane is unordered) and both fall under the Pareto curve bounding the maximum value on the y-axis given a value on the x-axis. Does this imply that Archi therefore is not more morphologically complex than English because they both belong to the same complexity class ‘human language’ or something? Well, we now know or seem to believe that human languages are vaguely optimal, Simon Kirby, Florian Jaeger, and many other people have been talking about this for ages. However, languages, when left alone, can sometimes become pretty intense.

Finally, coming back to the assumptions. In practice, it is good to assume nothing: you do not overfit. Theoretically, however, the less you assume about language, the less you cay say about it. To put it bluntly, we have just measured entropies of several collections of rewrite rules hidden inside a huge neural network and applied in some order to tree-structured collections of strings. If we do assume that speakers/learners actually do something of this kind (in using rewrite rules [quite likely] and assigning probabilities to their outcomes [not likely, if you ask me, although this is a basic tenet of cognitive science]), we assume much more than Turing computability. If we don’t, we’re not exactly doing what I would call linguistics.

Many thanks to Eitan Grossman for reading a draft version of this post and to the members of the HUJI CS dept NLP lab reading group for an enlightening discussion. The responsibility for errors of judgmenent, logic, and fact rests with this guy.

Farrell Ackerman and Robert Malouf. 2013. Morphological organization: The low conditional entropy conjecture. Language, 89(3):429–464.↩
Katharina Kann and Hinrich Schütze. 2016. Single-model encoder-decoder with explicit morphological representation for reinflection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 555–560, Berlin, Germany, August. Association for Computational Linguistics.↩