^{1}

^{*}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Antonio Benítez-Burraco, Sevilla University, Spain

Reviewed by: Caleb Everett, University of Miami, United States; Steven Moran, University of Neuchâtel, Switzerland

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Causal processes can give rise to distinctive distributions in the linguistic variables that they affect. Consequently, a secure understanding of a variable's distribution can hold a key to understanding the forces that have causally shaped it. A storied distribution in linguistics has been Zipf's law, a kind of power law. In the wake of a major debate in the sciences around power-law hypotheses and the unreliability of earlier methods of evaluating them, here we re-evaluate the distributions claimed to characterize phoneme frequencies. We infer the fit of power laws and three alternative distributions to 166 Australian languages, using a maximum likelihood framework. We find evidence supporting earlier results, but also nuancing them and increasing our understanding of them. Most notably, phonemic inventories appear to have a Zipfian-like frequency structure among their most-frequent members (though perhaps also a lognormal structure) but a geometric (or exponential) structure among the least-frequent. We compare these new insights the kinds of causal processes that affect the evolution of phonemic inventories over time, and identify a potential account for why, despite there being an important role for phonetic substance in phonemic change, we could still expect inventories with highly diverse phonetic content to share similar distributions of phoneme frequencies. We conclude with priorities for future work in this promising program of research.

Linguistic theorists seek to reveal causal mechanisms which explain the observable diversity of human language. Good causal hypotheses are often suggested by the mathematical distribution that a linguistic variable is described by, owing to the fact that the distribution can be understood as an emergent outcome of some underlying causal process, and that a given mathematical distribution will be consistent with only certain mathematical kinds of underlying processes. Consequently, it is important for the development of theory that proposed claims about distributions be as sound as possible. For instance, one of the most famous distributions in linguistics is the Zipfian distribution, which technically speaking is a kind of power law. Recently, however, the evaluation of putative power laws across the sciences have come under intense scrutiny and often been found wanting. In response, methodologists have developed more rigorous and secure methods for diagnosing power laws and for distinguishing them from similar but significantly different distributions. For linguists, this creates an opportunity, to re-examine our own putative power law distributions, and by doing so to improve the pathway to sound explanatory theorizing.

Here, we re-evaluate the status of mathematical distributions for characterizing phoneme frequencies. Previous studies have proposed that phoneme frequencies follow a particular member of the power law family, the Yule-Simon distribution (Martindale et al.,

Our focus here is on distributions and their potential for shedding light on language. Distributions are properties of variables. A variable can be defined as the set of values that characterize something, be it a

With these motivations in mind, how can an investigator decide that a certain distribution characterizes a variable satisfactorily? One method is visual inspection. This typically involves observing a close match between two histogram plots: one of the data and one of the candidate distribution, or plotting a regression of the data against the candidate distribution. In work on phoneme frequencies, visual inspection has been the primary method of assessing candidate distributions. A more rigorous alternative is to apply quantitative, statistical tests to evaluate how well a particular distribution model (such as the normal distribution) fits a data sample. The purpose of these tests is not to prove definitively that some variable follows a particular distribution, but rather to quantify the degree to which a sample's distribution is consistent with its having been drawn from a population of a particular distribution^{1}

A strong motivation for testing the consistency of observed data with a particular distribution, is that many distributions can be described as the

Since Yule's first demonstration of the link between power law distributions and preferential attachment processes, power laws have been used to characterize the distributions of a diverse array of phenomena in the natural and physical world and in human society (Clauset et al.,

Given the apparent pervasiveness of power laws in a diverse range of unrelated contexts, it is little surprise that there is a rich vein of literature dedicated to evaluating power laws (Clauset et al., ^{2}

The traditional approach to power law validation, following Pareto's (^{2} statistic could give an indication of the goodness of fit of the model. However, it has since been demonstrated that this traditional approach can be systematically unreliable (Clauset et al., ^{2} statistic commonly used to validate the presence of a power law (including by Tambovtsev and Martindale, ^{2} is reduced (Clauset et al.,

To remedy the shortcomings of earlier methods, Clauset et al. (

Some brief mathematical preliminaries will be useful here. When we refer to distributions, we are referring to mathematically defined ^{3}, or ^{−2} (which can also be written ^{2}). For present purposes, where we will not be concerned with negative quantities or zeros, we will use a more narrow definition by Clauset et al. (^{3}

In practice, Clauset et al. (_{min}. For example, in frequency data, it may be that only items whose frequencies meet or exceed a lower threshold will follow a power law. More generally, power law distributions come in a variety of specific forms, with different numbers of free parameters. We detail some of these in greater depth in the Discussion section below.

A distinction can be made between power laws that apply to continuous variables and those that apply to discrete ones. Frequency data, including the phoneme frequencies used in this study, are typically discrete. Zipf's Law (2) applies to a discrete number of _{1}≥_{2}≥…≥_{n}. For example, _{k} the frequency of the _{k}) is the relative frequency of the ^{4}

There is a long history of studying power laws in linguistics, however, the evaluation of statistical support for a power law relationship is far from straightforward and remains topical across a wide range of scientific fields (Stumpf and Porter, ^{2} statistic) has been shown to be systematically unreliable (Clauset et al.,

As our data, we take phoneme frequencies in the lexicons of 166 language varieties of Australia, covering the 19 of Australia's language families for which phonemic lexical data is available, including all major branches of the Pama-Nyungan family which dominates the continent^{5}

The phonemic frequencies in this study are extracted from wordlists. A consequence of this is that in our data, all lexemes are counted just once and so are weighted equally^{6}

Our data comes from the Ausphon-Lexicon database, under development by the second author (Round,

To illustrate an example of the phoneme frequencies in our sample,

Frequency of phonemes in Walmajarri lexicon (Hudson and Richards,

Phonological frequency data differs in some respects from the data types most commonly encountered in scientific power law studies, such as word frequencies or city populations. Typically, in order to understand a population (and some property of it), such as the cities in the United States (and their sizes), or the words of English (and their frequencies), it is impractical to examine every last member of the population, and so the study will examine a sample. Ensuring that a sample is of sufficient size is an important consideration, firstly in order to adequately represent the population and additionally because a sufficiently large sample size is an important requirement in maximum likelihood estimation (Barndorff-Nielsen and Cox, ^{7}

Given a sample of phonemes, we require an estimate, or measurement, of their frequencies. Measurement error is a potential concern in this study. Our segment frequencies are calculated from documented wordlists, which necessarily are limited representations of the complete vocabulary of the languages that the wordlists represent. One concern is that the particular morphology of a language's citation forms may cause certain segments in the language to be overrepresented in a wordlist which contains only citation forms. This would represent a bias, that is, a factor that pushes observations in a certain direction. We have attempted to control for this, by removing identifiable citation-form tense morphology from verbal words and noun-class prefixes from nominals. Another source of concern is that wordlists with a smaller number of words will necessarily entail a greater level of uncertainty in the observed segment frequencies. This will be a source of noise in the data. It does not push observed frequencies in any particular direction, but makes them generally less accurate. To address this, in our study, we restrict the language sample to language varieties with a minimum wordlist size of 250 lexical items. We selected 250 lexical items as a cut-off on the basis of Dockum and Bowern (

We test for the presence or absence of a power law in the distributions of phonological segments following the maximum likelihood framework described by Clauset et al. (

Estimate the parameters _{min} and α of the power law model using the maximum likelihood method (Barndorff-Nielsen and Cox, ^{8}

Calculate the goodness-of-fit between the data and the power law using the Kolmogorov–Smirnov (KS) statistic, where a larger value corresponds to a worse fit. Using a Monte Carlo procedure, a ^{9}^{10}

Compare the power law model with a set of models representing alternative hypotheses. For each alternative model, a bootstrapped

We use the ^{11}

As a brief point of comparison to prior work, we return to the Walmajarri example and plot the linear relationship between phoneme frequencies and rank on a log-log plot. Tambovtsev and Martindale (_{min} parameter, below which the power law ceases to apply. Visual inspection of other languages in the dataset indicates that Walmajarri's pattern of phoneme frequencies is common, although there is a good deal of variation (and, consequently, variation in the fit of a linear model). However, given the known limitations of applying a linear model to a log-log plot, we now turn to more reliable methods for validating the presence of a power law, using the maximum likelihood method outlined above.

Log-log plot of frequencies vs. frequency ranks in Walmajarri. When a linear model is fitted to the full distribution (dashed black), high and low frequency segments are overestimated and mid-rank segments are underestimated. When lowest-frequency segments are removed from the model (solid blue), the model appears to fit well.

We firstly infer the fit of a power law to the full distribution of phoneme frequencies for each language, without estimating an _{min} parameter. In

Power law (without _{min}).

α | 1.38 | 0.17 | 1.16 | 2.18 |

Goodness-of-fit | 0.35 | 0.07 | 0.15 | 0.53 |

0.01 | 0.03 | 0.00 | 0.27 |

Mean α is 1.38 (SD 0.17). As discussed earlier, the standard range of α is 2 < α < 3 (Clauset et al.,

Throughout this study, the possibility of type I error (false positives) must be taken into consideration. By setting our implausibility range at _{0} incorrectly in around 17 (10%) of those tests. In this instance though, we have rejected _{0} as implausible in 99% of the language sample. Thus, it is clear that the power law distribution is being deemed implausible not merely by chance. It is genuinely a poor fit for the vast majority of languages. This result accords well with earlier work which has found that a simple, one-parameter form of the power law distribution poorly characterizes phoneme frequencies (Sigurd,

As discussed earlier, our dataset of phoneme frequencies is very likely to contain the complete population of phonemes in each language. At the same time, the number of observations per language is low—ranging from 16 to 37 segments in our language sample (mean 24.5, SD 3.8). Such a small set of observations can be a barrier to highly accurate maximum likelihood estimation. Clauset et al. (

If the power law distribution, as inferred above, is inadequate for characterizing phoneme frequencies, then what other options are there? There are a couple of approaches to this question. One is to add an additional parameter to improve the fit of the power law; the other is to consider alternative distribution types. In this and the following sections we explore both approaches.

Here, we infer the fit of a power law distribution with an additional _{min} parameter, whose effect is to remove some of the least-frequent observations from the sample which is being fitted. As above, we use maximum likelihood to infer the best-fitting _{min} threshold for each language. Results are summarized in

Power law distribution (with _{min}).

α | 2.75 | 0.65 | 1.51 | 6.14 |

Goodness-of-fit | 0.14 | 0.03 | 0.08 | 0.22 |

0.62 | 0.26 | 0.01 | 1.00 |

After inferring an _{min} parameter, the power law distribution is fitted to an average of only 13.9 segments, though there is a wide degree of variation (SD 4.1). In percentage terms, the power law distribution is fitted to an average of 57% of a language's segmental inventory (SD 16%). 117 languages (70%) fall within the normal 2–3 range for α. Having only a small number of included observations above the _{min} threshold can drive unreasonably high estimates of the α scaling parameter. A sizeable portion of our sample (43 languages, 26%) fall in this high range with α above 3. At the other extreme, 6 languages (4%) have an unusually low α under 2. Mean α is 2.75 (SD 0.65).

When _{min} is included, the power law hypothesis is accepted as plausible (though, to emphasize, not necessarily correct) in the 158 of 166 language varieties for which _{0} at around 1 in 100. The likelihood of a 1-in-100 event is high in a set of 166 tests. Overall, since the number of

Although we have failed to rule out the power law distribution as implausible for any specific language, this still does not mean that the power law distribution is the optimal one for our data, and there are some important caveats to our results so far.

A distribution will always fit a set of data at least as well as the same distribution with one fewer parameter. Thus, the observation that the power law distribution fits better when _{min} is added requires some interpretation. Of greatest interest in this respect is the striking degree of improvement in fit, such that the power law distribution shifts from a largely implausible fit against full phoneme inventories, to a largely plausible fit after we exclude the least-frequent observations from samples. This raises the obvious question of why this might be so. We consider this in our Discussion, after we have also examined distributional alternatives to power laws.

The inclusion of an _{min} parameter when fitting power laws is common practice, but its use is most obviously motivated in contexts where there are very many possible observations. For example, Clauset et al. (_{min} of 7 (±2). Words occurring fewer than 7 times can be disregarded and this still leaves nearly 3,000 unique words to which the power law distribution can be fitted. In contrast to this typical use case, where a large number of observations remain in play and do fit the power law, our use of _{min} with phoneme datasets results in the exclusion of data points from an already small sample, leaving an even smaller set of data being fitted. As a general fact, it is inherently difficult to identify the most appropriate distribution for a small collection of observations. Correspondingly, it is not automatically an insightful finding, that a power law can be plausibly fitted to such small datasets. However, as mentioned just above, it is noteworthy that the same power law did not fit well to the slightly larger datasets that were being used without the _{min} parameter. This suggests that it is not the merely small size of the dataset which is causing the good plausibility of the fit.

Small samples of observations can inflate

In addition to considering the merits of adding extra parameters to a distribution, we must also consider whether a completely different distribution would provide an equally good or better fit to the data. We consider three alternative distributions, which are not part of the power law family and may suggest different underlying generative processes. These are the lognormal, exponential and Poisson distributions. Like the power law distribution, the shape of these distributions can have a sharp initial peak and a rapidly decaying tail, as illustrated in

Four distribution types: power law, exponential, lognormal and Poisson, each illustrated with four parameterizations.

The lognormal distribution is one where the data form a normal distribution when transformed on a log scale. Once again, we use the _{min} parameter at this stage (though see below). The lognormal distribution narrowly construed is a continuous distribution, however the

As for the power laws above, we calculate bootstrapped _{min} a better fit than the power law distribution with no _{min}, which we ruled out for up to 99% of languages in the sample. One caveat to keep in mind is that the lognormal distribution is minimally defined by two parameters rather than one, which potentially puts it at an advantage compared to the single-parameter power law distribution.

An exponential distribution, and its discrete analog, a geometric distribution, is one in which frequencies decay at a constant proportional rate. Thus, for the frequencies of any two successively-ranked phonemes, _{k} and _{k+1}, the distribution is characterized by a rate parameter, λ, so that _{k+1} = λ._{k}. Here, as above, we use maximum likelihood to estimate the parameter λ and the bootstrapping procedure to obtain a ^{12}

Bootstrapped

The final distribution we consider is the Poisson distribution, which is related to the exponential distribution. The Poisson distribution is typically used to model the frequency of an event within some interval of time or space. Our case is a bit different since we are modeling the relationship between the frequency of many different events (different phonological segments) and their frequency rank in a language's phonological inventory. As with the exponential distribution, we use maximum likelihood to estimate a single parameter, λ, and use bootstrapping to obtain a

The Poisson distribution is totally implausible for all languages in our language sample. Goodness-of-fit statistics range from 0.43 to 0.75 (mean 0.59, SD 0.07). We find

In _{min} is included, just as we did for the power law distribution. (Note: for one language, the bootstrapped _{min}. This is the only distribution we tested which has three free parameters, and in this instance, the algorithmic procedure struggles to differentiate solutions with very similar likelihoods). Perhaps most noteworthy among these results is the greatly increased inconclusiveness of the method when applied to the reduced set of data points lying above the _{min} threshold. When the fitting task is restricted to a subset of only the most frequent segments in a language, it is possible to plausibly fit all but the Poisson distribution to any language, after type I error is factored in. One difference, which we nuance further in the next section, is that power law distributions with _{min} are fitted on average to only 57% of a language's phonemes, whereas the lognormal and exponential distributions are fitted to closer to 80%.

Results summary.

_{min} |
_{min} |
||
---|---|---|---|

Power law | 2 (1%) | 158 (95%) | 56% |

Lognormal | 93 (56%) | 155 (93%) | 78% |

Exponential | 147 (89%) | 146 (88%) | 84% |

Poisson | 0 (0%) | 43 (26%) | 17% |

_{min}

The third and final step in Clauset's et al. (

Vuong's (

We begin by comparing distributions without the _{min} parameter. As summarized in _{min}) have already been rejected as implausible for all or nearly all languages. Accordingly, we conduct just one likelihood ratio test per language, comparing the fit of the exponential vs. lognormal distributions. Overall, we find that Vuong's likelihood ratio test somewhat favors the exponential distribution. Likelihood ratios favor the exponential distribution for 122 languages, and the lognormal distribution for 44 languages. However after Bonferroni correction, the difference in the likelihood of exponential and lognormal models is statistically significant for only two languages, Thaynakwithi and Dalabon, both favoring the exponential distribution.

Turning to distributions with the _{min} parameter, since we have already rejected the Poisson distribution, we conduct likelihood ratio tests pairwise among the remaining three distributions. In order to compare distributions with _{min} parameters, it is necessary to set _{min} to the same value in both distributions (Gillespie, _{min} value from distribution A and using it, re-estimate the other parameters of distribution B, and conduct one likelihood ratio test. Then we take _{min} from B, use it and re-estimate the other parameters of distribution A, and conduct a second likelihood ratio test, giving two results for each pair of distributions.

Comparing the exponential and lognormal distributions, the likelihood ratios favor the lognormal distribution (139 languages to 27) using _{min} from the lognormal fit, and favors the exponential distribution (103 languages to 63) using _{min} from the exponential fit, however none of these comparisons reaches significance after Bonferroni correction.

Comparing the power law and lognormal distributions, likelihood ratios favor the lognormal distribution (144 languages to 22) using _{min} from the power law fit, and all languages when using _{min} from the lognormal fit, however only two of these comparisons reaches significance after Bonferroni correction. Yir Yoront favors the power law when using _{min} from the power law fit and Malyangapa favors the lognormal distribution using _{min} from the lognormal.

Comparing the power law and exponential distributions, the likelihood ratios favor the power law (133 languages to 33) when taking _{min} from the power law fit, though no comparison reaches significance. They favor the exponential distribution 164 languages to 2 when _{min} is taken from the exponential fit. Thirteen of those comparisons reach significance.

In sum, we found earlier that when parameterized without _{min}, only the exponential and lognormal distributions were broadly plausible. Voung's likelihood ratio test marginally favors the exponential test over the lognormal when fitted against entire phonemic inventories, but the difference is at most slight. When parameterized with _{min}, the power law distribution is fitted to around 60% of languages' phonemes on average, while the exponential and lognormal are fitted to around 80% (_{min} parameter to the other, provide slender evidence of the following. Even when fitted against the small phonemic subsets favored by the power law, the lognormal distribution may weakly outperform the power law, but the exponential distribution does not. Fitted against the larger subsets favored by the exponential and lognormal distributions, the power law is outperformed by the exponential and lognormal. The performance of the latter two distributions is indistinguishable.

Here we contextualize the current study more fully in the history of research into power laws and phoneme frequencies, before drawing implications and conclusions from the new findings we have obtained.

Investigation of power laws in the linguistic sphere has a long history. One of the oldest and best-known examples of a power law in any discipline is the distribution of word frequencies in text corpora, first noted by Estoup (

Power laws have also been proposed to describe the distribution of phoneme frequencies. The use of Zipf's Law to model the frequencies of phonological segments initially appears to be an attractive prospect (Witten and Bell,

Sigurd (_{k+1} = λ._{k} for some rate parameter λ, giving the discrete distribution:

Good (

Which is to say:

In support of this model, Good (

As

Gusein-Zade (^{13}

Martindale et al. (^{2} statistic in a linear regression, they compare the fit of the parameter-free negative logarithmic equation of Borodovsky and Gusein-Zade (

The Yule-Simon equation in (7) is the product of the power law in (2) and the geometric equation in (3). Because of the differing rates at which the two parts of the equation decay as

The Yule-Simon equation in (7) has not just one free parameter but two, the exponent α and the rate λ, and the Zipfian and Sigurd equations are effectively special cases of it, each with one parameter fewer. The Zipfian distribution is equivalent to (7) with λ set to 1 (so that λ^{k} = 1), while the geometric equation is equivalent to (7) with α set to zero (so that 1/^{α} = 1). This is important, since as a general fact, if distribution A is a special case of distribution B, with fewer free parameters than it, then B will always perform at least as well as A when fitting the same set of data. Thus, the Yule-Simon distribution will necessarily fit the same set of data at least as well as the Zipfian distribution, and Sigurd's geometric distribution.

Martindale et al. (^{2} is significant between different distributions for each language group. Again, they find that the Yule-Simon distribution fits best overall^{14}

Obtaining a better fit by using a distribution with an additional parameter may be relatively trivial mathematically speaking, but this does not mean it is uninteresting. The extra parameter may work to capture a significant real-world nuance in an underlying causal process or describe the effect of one or more secondary processes. A compelling causal explanation of a complex distribution might therefore be formulated by identifying some real-world factor and explaining how its mathematical effect on the distribution is expected to match what we find. It is also important to consider the possibility of equifinality—the fact that multiple, different real-world phenomena may have equivalent mathematical effects. Tests of goodness-of-fit examine only the mathematical aspect, and cannot distinguish between different phenomena whose detectable mathematical contribution is equivalent.

Martindale et al. (^{k} that differentiates the Yule-Simon distribution (7) from the Zipfian (2)^{15}

Power laws have attracted wide and sustained scientific interest. Recent debates on their validity have prompted the development and widespread adoption of novel evaluation methods that are more reliable than those used in the past (Clauset et al.,

Using a more reliable evaluation procedure than previous investigations, we have confirmed the finding that a basic power law distribution, with a single free parameter, is generally insufficient for characterizing phoneme frequencies. Additionally, we reconfirm a result going back to Sigurd (

A second novel contribution was to consider the addition of an _{min} parameter, a practice which is now common in power law research. Notably, while power laws are largely implausible fits for entire phoneme inventories, their plausibility is improved strikingly once a subset of the least-frequent phonemes is removed from the sample. This is despite that fact that the full inventories and the reduced ones share the property of comprising notably small samples. The subset removed in order to achieve maximum likelihood is on average large, at 43%. This result indicates that power laws constitute a plausible characterization for the more-frequent portion of phonemic inventories, and explains why the upper end of a Yule-Simon distribution, which most closely approximates a power law, should be a reasonable fit. We note however, that the lognormal distribution also performs well in this same, high frequency region of phonemic inventories. Exponential (or geometric) distributions do not fit the higher-frequency portion of inventories as well the power law or lognormal do, but they are good fits for entire inventories, suggesting that they fit particularly well in lower-frequency portions. This would explain why the lower end of a Yule-Simon distribution, which most closely approximates a geometric distribution, should be a reasonable fit.

Using an evaluation procedure which has since been shown to be unreliable, Martindale et al. (_{min} parameters, we have provided some estimates of where power-law-like behavior starts to cut out within a phonemic inventory. To understand what these results entail for theory, we return to the question of causal processes.

Ultimately, linguistic theory seeks to explain the patterns that can be observed in human language. This endeavor will be aided by a sound knowledge of which mathematical distributions plausibly characterize a given variable

It can be emphasized that in this model of progress, a good understanding of families of stochastic processes will play an important, enabling role. Accordingly, it will be useful to harness advances that have already been made elsewhere. For instance, many discrete systems can be profitably conceptualized in terms of urn processes with characteristically associated distributions. As Kuba and Panholzer (

Naturally, any observation-matching families of stochastic processes will still need to be related to linguistically plausible causal processes (Cysouw,

Phoneme frequencies undergo constant modification due to changes occurring at multiple levels of linguistic structure, including the phonology proper, morphology, syntax, and the grammar of discourse. Within the phonology itself, fundamental changes include deletions, insertions and changes of phonemic category, all of which will impact phoneme frequency distributions. In an urn model, historical deletions will result in balls being removed from a phoneme's corresponding urn while insertions will add balls. Adding a level of complexity, many phonological changes affect not just one phoneme category, but natural classes of sounds, thus deletions and insertions that apply to natural classes will remove or add balls simultaneously from a non-random, “natural” set of urns.

Phonemic mergers (Hoenigswald,

A phonemic category will map onto multiple actual speech sounds, or phones, and the phonemic categorization of a phone frequently depends on its contextual environment. Consequently, it is not uncommon for multiple types of phonemic changes to occur at once, owing to the fact that when one sound changes, so too does the contextual environment of its neighbors. Under these circumstances, the ubiquity of non-uniform distributions of sounds in various contexts will lead to non-accidental, correlational relationships among these coupled changes.

At the morphological level, changes in the frequency of certain formatives will cause concerted frequency changes in the set of phonemes comprising the formative, though in this case, there is no expectation that the set involved will form a natural class. If the object of study is the discourse frequencies of phonemes, then similar effects will arise when the usage frequencies of words undergo change. Furthermore, lexicons are not closed systems. Words can be borrowed from other languages. The effects of borrowing on phoneme frequencies in the recipient language is a complex matter (Boretzky,

There are implications to be drawn from this, for the explanation of distributions of phoneme frequencies. We turn to these implications below.

In most of the historical changes noted above, it matters not only whether, and how many, balls are moving from one urn to another, but exactly which real-world phonemes the urns themselves represent. Phonemes form natural classes, which may change in tandem. Phonotactic arrangements, which, notwithstanding variation, exhibit strong similarities cross-linguistically, will impact how changes mediated by contexts are correlated. And borrowing too operates in reference to phonemes' actual substance. This importance of phonemic substance raises a host of questions, which can be addressed in new and potentially revealing ways within a stochastic-modeling research program. For instance, the frequencies with which various kinds of phonemes are present or absent across the world's languages varies greatly (Maddieson,

Even if we abstract away from issues related to phoneme substance, a successful stochastic model of phoneme frequency evolution may still require a mixture of sub-models, obeying different principles, given the existence of the multiple types of historical processes that affect phoneme frequency: deletion, insertion, splits, mergers, borrowings. Though it might be clear in outline what (some of) these sub-models must be, it is far less clear, empirically speaking, what their quantitative parameterizations are. In what precise mix do these sub-processes occur? How often do events occur independently or in concert? What empirical grounds do we have for estimating these parameters, and what are our levels of uncertainty? In these respects, there exists another interesting role to be played by stochastic modeling studies. We currently lack empirical answers to many of these quantitative questions, but can we reveal limits to what is plausibly compatible with observed phoneme distributions? A related open question, connected to issues raised in the previous paragraph, is to what extent a successful stochastic model will need to reflect empirical complexities, like phonemic substance, natural classes, phonotactics and concerted changes, as points of non-independence within and between sub-models, or can some of these be ignored with little impact on the outcomes predicted by the model? Whatever the answers may be, we see a productive and interesting field of inquiring lying directly ahead.

We conclude this section by returning to our empirical findings and situating them within the causal reasoning just outlined. Our empirical finding was that the more frequent phonemes in a language's inventory tend to be distributed more in line with a power law, and the less frequent more in line with an exponential distribution. As we mentioned earlier, power law distributions can be stochastically generated by preferential attachment processes. In addition to this, exponential distributions can be stochastically generated by a so-called birth-death process, in which entities arise at some characteristic birth rate λ, and disappear at a characteristic death rate μ. It may be—and here we are speculating, but in a reasoned manner—that as tokens of phones shift between phonemic categories during the kinds of sound change processes we mentioned above, different phonemic categories have different probabilistic propensities to be source categories (i.e., the erstwhile category of the changing phone) and destination categories (its new category after the change). This alone would constitute a birth-death regime (Cysouw,

There are likely to be additional, valuable variables, and possibly more tractable variables, to study beyond just phoneme frequency distributions. In this paper we have focused on phoneme frequencies because they have occupied such a prominent place in the history of investigations of distributions. However, by doing so we do not wish to suggest that phonemes ought to continue to occupy such a prominent place. Firstly, we agree with Piantadosi (

There are many branching paths of research ahead. Nevertheless, there will be some basic principles that we see remaining constant, and these have been the core focus of our contribution here. In this paper we have demonstrated and outlined a template for future work on distributions. Ideally, such work should begin with critical assessment of links that can be made between existing or new causal hypotheses, including diachronic processes, and from these, via stochastic models, to particular distributional outcomes. Subsequently, the fit of the hypothesized distributions to real-world data should be evaluated rigorously using robust statistical methods. Lastly, an attempt must be made to rule out competing distribution types and alternative generative mechanisms. As our investigation demonstrates, this may be challenging, given the inherent limitations of working with small sets of observations. Maintaining a clear-eyed view of these limitations, and using advances already made in allied fields, will help spur this field of inquiry to new, robust insights into the dynamics of phoneme inventories.

The datasets presented in this study can be found in an online repository. The name of the repository and citation details can be found at:

JM-C conceived and designed the study with guidance from ER. ER compiled the study's data. Together, JM-C and ER performed the analysis and interpreted the results. JM-C composed an initial draft of the manuscript and compiled the

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest

We wish to thank Caleb Everett and Steven Moran for their insightful comments on earlier drafts of the manuscript. Their feedback and guidance helped us to improve the paper considerably, and we appreciate the generosity of their time and effort.

The Supplementary Material for this article can be found online at:

^{1}Within frequentist statistics, there are tests which test against the null hypothesis that data are drawn from a particular distribution (Shapiro-Wilk, Kolmogorov-Smirnov, Pearson's chi-squared test, to name a few). Within a Bayesian framework (Spiegelhalter,

^{2}The question of whether certain phenomena are characterized best by a power law model or some other distribution can be contentious. See, for example, the debate between Eeckhout (

^{3}For the area under the distribution curve to integrate properly to 1, the power function 1/^{α} must be multiplied by a normalization constant (denoted ^{α}). The normalization constant will be calculated differently depending on the value of α and whether

^{4}This is equivalent to the probability that a token selected at random belongs to the

^{5}The dataset closely approximates an exhaustive sample, containing around 80% of Australian languages for which phonemic lexical data is known to exist, and on the order of 40% of the varieties that were spoken at the onset of European colonization.

^{6}There is a significant body of research suggesting that frequencies defined in this manner are implicitly accessible to speakers and thus psychologically real (for example, Coleman and Pierrehumbert,

^{7}The probability that we have failed to observe some phoneme that exists in a language is small, though non-zero. In

^{8}Maximum likelihood estimation (MLE) is a method for estimating the parameters in a statistical model, given some set of observations by finding the set of parameter values, _{min} in a power law model.

^{9}This is a well-established statistical technique. A large number of simulated datasets are created, with data points drawn from the model power law distribution hypothesized in step 1. Each is then fitted to its own power law model and a KS statistic is calculated for the simulated dataset, relative to this model. The

^{10}Here we follow the method of Clauset et al. (

^{11}We find that 10,000 iterations are sufficient to obtain stable parameter estimates. Beyond 10,000 iterations, estimates will continue to fluctuate but in a tightly prescribed range. Plots of all bootstrapping runs can be viewed in the interactive visualization app provided in

^{12}This generates a relationship where

^{13}Of course, the statistics of graphemes are different from the statistics of phonological segments. As Bloomfield (

^{14}Although in their statistical tests they do not adjust their significance levels to correct for multiple hypothesis testing.

^{15}The Yule-Simon equation, which Martindale et al. (