^{1}

^{2}

^{3}

^{4}

^{2}

^{1}

^{2}

^{5}

^{6}

^{7}

^{2}

^{4}

^{2}

^{3}

^{4}

^{8}

^{9}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

Edited by: Claudia Wagner, Leibniz Institut für Sozialwissenschaften (GESIS), Germany

Reviewed by: Bosiljka Tadic, Jožef Stefan Institute (IJS), Slovenia; Haroldo Valentin Ribeiro, Universidade Estadual de Maringá, Brazil

This article was submitted to Interdisciplinary Physics, a section of the journal Frontiers in Physics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books

The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of

We have previously studied the temporal evolution of word usage (1-grams) for six Indo-European languages: English, Spanish, French, Russian, German, and Italian, between 1800 and 2009 [

To characterize this generic feature of rank dynamics, we have proposed the

In this work, we extend our previous analysis of rank dynamics to

Figure

Rank evolution of

Rank diversity for different languages and _{10}

As Figure

where μ is the mean and σ the standard deviation of the sigmoid, both dependent on language and

Fit parameters for rank diversity for different languages,

^{2} |
^{2} |
^{2} |
^{2} |
^{2} |
^{2} |
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

English | 2.259 | 0.622 | 0.02 | 2.13 | 0.72 | 0.016 | 1.834 | 0.816 | 0.014 | 1.748 | 0.781 | 0.012 | 1.546 | 0.817 | 0.01 | 2.605 | 0.598 | 0.024 |

French | 2.254 | 0.637 | 0.021 | 2.178 | 0.693 | 0.017 | 1.796 | 0.828 | 0.013 | 1.629 | 0.825 | 0.011 | 1.348 | 0.862 | 0.01 | 2.684 | 0.598 | 0.022 |

German | 2.231 | 0.598 | 0.018 | 2.127 | 0.695 | 0.015 | 1.695 | 0.831 | 0.012 | 1.483 | 0.8 | 0.01 | 0.999 | 0.923 | 0.007 | 2.509 | 0.636 | 0.02 |

Italian | 2.197 | 0.636 | 0.018 | 2.016 | 0.726 | 0.014 | 1.63 | 0.836 | 0.011 | 1.23 | 0.944 | 0.009 | 0.945 | 0.954 | 0.007 | 2.53 | 0.627 | 0.019 |

Russian | 2.063 | 0.603 | 0.015 | 1.814 | 0.766 | 0.011 | 1.549 | 0.776 | 0.009 | 1.411 | 0.718 | 0.008 | 1.252 | 0.709 | 0.006 | 2.228 | 0.628 | 0.017 |

Spanish | 2.115 | 0.7 | 0.018 | 2.061 | 0.681 | 0.018 | 1.683 | 0.85 | 0.012 | 1.376 | 0.898 | 0.01 | 1.053 | 0.938 | 0.008 | 2.573 | 0.551 | 0.024 |

In Cocho et al. [

In Figure

Fitted parameters for rank diversity. Parameters μ and σ for the sigmoid fit of the rank diversity

In order to understand the dependence of language use — as measured by

As can be seen in Figure

Rank diversity in the null model. Rank diversity

The amount of structure each language exhibits can be quantified by the

The 2-grams with the highest

^{5} 2-grams in 2008. Each point represents a 2-gram found in the empirical data. The blue dots show the median

After normalizing the results to account for varying total word frequencies between different language datasets, we see that all languages exhibit a similar tendency for the ^{−1/2} (see section 4.2).

Motivated by the observation that some words appear alongside a diverse range of other words, whereas others appear more consistently with the same small set of words, we examine the distribution of next-word entropies. Specifically, we define the

Next-word entropy for different languages and null model. The next-word entropy is calculated for each word using Equation (15). The plotted values are the normalized frequencies (i.e., probabilities) of words whose next-word entropy falls within each bin; the first bin contains values greater than or equal to 0 and less than 1/2, the second contains values greater than or equal to 1/2 and less than 1, and so on.

To complement the analysis of rank diversity, we propose a related measure: the change probability

Figure

Change probability for different languages and

Fit parameters for change probability for different languages.

^{2} |
^{2} |
^{2} |
^{2} |
^{2} |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

English | 1.488 | 0.553 | 0.009 | 1.3 | 0.536 | 0.009 | 0.868 | 0.655 | 0.006 | 0.869 | 0.598 | 0.005 | 0.677 | 0.609 | 0.004 |

French | 1.626 | 0.401 | 0.009 | 1.303 | 0.571 | 0.008 | 0.792 | 0.664 | 0.005 | 0.793 | 0.563 | 0.004 | 0.738 | 0.429 | 0.004 |

German | 1.472 | 0.543 | 0.009 | 1.249 | 0.561 | 0.007 | 0.535 | 0.826 | 0.004 | 0.657 | 0.587 | 0.004 | 0.186 | 0.691 | 0.003 |

Italian | 1.439 | 0.436 | 0.008 | 1.035 | 0.631 | 0.006 | 0.564 | 0.67 | 0.004 | 0.362 | 0.669 | 0.003 | 0.086 | 0.704 | 0.003 |

Russian | 1.204 | 0.574 | 0.006 | 0.774 | 0.714 | 0.005 | 0.772 | 0.559 | 0.004 | 0.692 | 0.491 | 0.004 | 0.518 | 0.516 | 0.003 |

Spanish | 1.48 | 0.355 | 0.009 | 1.283 | 0.558 | 0.009 | 0.532 | 0.761 | 0.005 | 0.398 | 0.777 | 0.003 | 0.062 | 0.826 | 0.003 |

Fitted parameters for change probability. Parameters μ and σ for the sigmoid fit of the change probability

We can define another related measure: the rank entropy

Rank entropy for different languages and

Fit parameters for rank entropy for different languages.

^{2} |
^{2} |
^{2} |
^{2} |
^{2} |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

English | 0.741 | 0.892 | 0.01 | 0.619 | 0.913 | 0.009 | −0.288 | 1.294 | 0.003 | −0.454 | 1.332 | 0.003 | −0.276 | 1.169 | 0.004 |

French | 0.863 | 0.848 | 0.012 | 0.398 | 1.077 | 0.01 | −0.521 | 1.395 | 0.002 | −0.464 | 1.302 | 0.002 | −0.494 | 1.207 | 0.002 |

German | 0.799 | 0.859 | 0.01 | 0.176 | 1.182 | 0.007 | −0.609 | 1.403 | 0.002 | −0.405 | 1.195 | 0.002 | −0.434 | 1.052 | 0.001 |

Italian | 0.783 | 0.855 | 0.011 | −0.273 | 1.349 | 0.004 | −0.427 | 1.281 | 0.002 | −0.184 | 1.032 | 0.003 | −0.717 | 1.184 | 0.001 |

Russian | 0.459 | 0.958 | 0.009 | −0.419 | 1.321 | 0.003 | −0.19 | 1.097 | 0.002 | 0.091 | 0.87 | 0.003 | 0.052 | 0.822 | 0.002 |

Spanish | 0.61 | 0.977 | 0.012 | 0.598 | 0.901 | 0.008 | −0.721 | 1.443 | 0.002 | −0.503 | 1.259 | 0.002 | −0.404 | 1.089 | 0.002 |

The μ and σ values are compared in Figure

Fitted parameters for rank entropy. Parameters μ and σ for the sigmoid fit of rank entropy

It should be noted that the original datasets for tetragrams and pentagrams are much smaller than for digrams and trigrams. Whether this is related with the change of behavior in σ between

Finally, we define the rank complexity

This measure of complexity represents a balance between stability (low entropy) and change (high entropy) [^{2} for ^{3} for

Rank complexity for different languages and

Our statistical analysis suggests that human language is an example of a cultural construct where macroscopic statistics (usage frequencies of

While the alphabet, the grammar, and the subject matter of a text can vary greatly among languages, unifying statistical patterns do exist, and they allow us to study language as a social and cultural phenomenon without limiting our conclusions to one specific language. We have shown that despite many clear differences between the six languages we have studied, each language balances a versatile but stable core of words with less frequent but adaptable (and more content-specific) words in a very similar way. This leads to linguistic structures that deviate far from what would be expected in a random “language” of shuffled 1-grams. In particular, it causes the most commonly used word combinations to deviate further from random that those at the other end of the usage scale.

If we are to assume that all languages have converged on the same pattern because it is in some way “optimal,” then it is perhaps this statistical property that allows word combinations to carry more information that the sum of their parts; to allow words to combine in the most efficient way possible in order to convey a concept that cannot be conveyed through a sequence of disconnected words. The question of whether or not the results we report here are consistent with theories of language evolution [

It should be noted that our statistical analyses conform to a coarse grained description of language change, which certainly can be performed at a much finer scale in particular contexts [

Apart from studying rank diversity, in this work we have introduced measures of change probability, rank entropy, and rank complexity. Analytically, the change probability is simpler to treat than rank diversity, as the latter varies with the number of time intervals considered (

In Cocho et al. [_{10}

Language core parameters. Upper bound rank log_{10}

English | 3.503 | 3,182 | 3.57 | 3,716 | 3.465 | 2,918 | 3.311 | 2,047 | 3.18 | 1,514 | 3.801 | 6,322 |

French | 3.528 | 3,371 | 3.563 | 3,657 | 3.452 | 2,829 | 3.279 | 1,899 | 3.071 | 1,178 | 3.881 | 7,601 |

German | 3.426 | 2,668 | 3.517 | 3,288 | 3.358 | 2,279 | 3.083 | 1,212 | 2.844 | 699 | 3.78 | 6,032 |

Italian | 3.47 | 2,952 | 3.468 | 2,936 | 3.302 | 2,006 | 3.117 | 1,308 | 2.853 | 713 | 3.784 | 6,078 |

Russian | 3.269 | 1,858 | 3.346 | 2,218 | 3.101 | 1,261 | 2.848 | 705 | 2.67 | 467 | 3.483 | 3,042 |

Spanish | 3.515 | 3,275 | 3.424 | 2,656 | 3.382 | 2,410 | 3.172 | 1,487 | 2.929 | 850 | 3.675 | 4,728 |

Our results may have implications for next-word prediction algorithms used in modern typing interfaces like smartphones. Lower ranked

Beyond the previous considerations, perhaps the most relevant aspect of our results is that the rank dynamics of language use is generic not only for all six languages, but for all five scales studied. Whether the generic properties of rank diversity and related measures are universal still remains to be explored. Yet, we expect this and other research questions to be answered in the coming years as more data on language use and human culture becomes available.

Data was obtained from the Google Books ^{1}

where |

where δ(

where

so as to normalize

We first describe a shuffling process that eliminates any structure found within the 2-gram data, while preserving the frequency of individual words. Consider a sequence consisting of the most frequent word a number of times equal to its frequency, followed by the second most frequent word a number of times equal to its frequency, and so on all the way up to the 10, 913

We now derive an expression for the probability that a 2-gram will have a given frequency after shuffling has been performed. Let _{i} denote the number of times the word _{ij} the number of times the 2-gram _{ij}) that _{ij} times in the table. We can think of _{ij}) as the probability that exactly _{ij} occurrences of _{i} < _{j}, _{ij} is determined by _{i} independent Bernoulli trials with the probability of success equal to the probability that the next word will be _{j}/

This distribution meets the condition that allows it to be approximated by a Poisson distribution, namely that _{i}_{j}/

where

is the mean, and also the variance, of the distribution of values of _{ij}.

For each 2-gram we calculate the _{ij} by subtracting the mean of the null distribution and dividing by the standard deviation,

In other words, the

To compare _{i} = _{j} = _{ij} =

so an upper bound exists at _{i} = _{j} ≈ _{ij} =

We thus define the normalized

To understand how the _{i, j} = _{j}, so Equation (13) reduces to

Now consider only the subset of 2-grams that start with _{i} is constant within this subset, we have _{j} is the rank of

Unlike in other parts of this study, the shuffling analysis is applied to the 10^{5} lowest ranked 2-grams.

The relationship between rank and

The curve fitting for rank diversity, change probability, and rank entropy has been made with the scipy-numpy package using the non-linear least squares method (Levenberg-Marquardt algorithm). For rank entropy, we average data over each ten ranks, _{10}(_{i}). Like for rank entropy, a sigmoid (Eq. 1) is fitted for log_{10}(_{10}(

where _{i} and _{i} is the real value of _{i}). For

All authors contributed to the conception of the paper. JM, EC, and SS processed and analyzed the data. EC and GI devised the null model. CP and EC made the figures. EC, GI, JF, and CG wrote sections of the paper. All authors contributed to manuscript revision, read and approved the final version of the article.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We appreciate useful comments from the reviewers which improved the presentation of the results.

The Supplementary Material for this article can be found online at:

^{1}