first draft: January 26, 2008.
revised:     January 30, 2008.



First and last words - on word lengths.

Someone [who?] noticed that the first words on every line in the Voynich Manuscript are on average longer than the words in other positions. That looks strange because the line is an arbitrary part of a subparagraph. Here is a table of the average wordlengths (first column) of words in a line from the VMs. In the second column the wordlengths are given counting backwards from the last word in the line. Only the first and last 5 positions are given, the remainder shows no surprises. The transcription used is EVA majority vote, I will deal with other transcriptions later.

Table 1. Average wordlengths in lines from VMs (EVA)
         overall average wordlength 4.98
====================================================
position  1: 5.50  - position from end  1: 4.49  
position  2: 4.94  - position from end  2: 4.89  
position  3: 5.16  - position from end  3: 4.88  
position  4: 5.07  - position from end  4: 4.98  
position  5: 5.05  - position from end  5: 5.07  
etc...
=====================================================

Indeed the first word is on avarage half a letter longer!  
One explanation could be word wrap: nearing the end of a line longer words are less likely to fit into the remaining space. So longer words are more likely moved to start the next line. That is if they are not broken off.

From the table another thing strikes as strange: the last words are on average almost half a letter shorter! Would the same line of reasoning hold, in that shorter words are more likely to fit in the remaining space at the end of a line? Or are they parts of words broken off at the end of a line? I will get to that later.

Suppose that the first word effect is caused by word wrap. Then the words at the start and end of a paragraph should be excluded, since they are not affected by it. The next table shows the corrected result:

Table 2. Average wordlengths in lines from VMs (EVA)
         exclude first and last words of paragraphs.
         Average wordlength 4.99
====================================================
position  1: 5.28  - position from end  1: 4.44  
position  2: 4.98  - position from end  2: 4.94  
position  3: 5.21  - position from end  3: 4.93  
position  4: 5.11  - position from end  4: 5.03  
position  5: 5.07  - position from end  5: 5.10  
etc...
====================================================

Still the first word is on average longer. However only by .3 letter. This is due to the fact that the first words from paragraphs are on average 1.2 letter longer. The last word is now shorter by .5 letter.

To see what the cause of the effect might be I performed the following test. Put the whole manuscript text into one long line, skipping labels and circular text. Then write that line in a 80 column wide document, starting a new line whenever a word does not fit in the remaining line space (word wrap). In this way the VMs text is regrouped into lines where the first words are likely not systematically different from the others. The resulting first and last word average lengths are found in the following table:

Table 3. Average wordlengths in lines from regrouped VMs  
         Average wordlength 4.96
=========================================================
position  1 2454  5.41  - from end  1: 2454  5.07  
position  2 2454  5.04  - from end  2: 2454  5.05  
position  3 2454  4.97  - from end  3: 2454  4.96  
position  4 2454  5.03  - from end  4: 2454  4.91  
position  5 2454  4.95  - from end  5: 2454  5.03  
etc...
==========================================================

The first words are still longer! It is clear that this effect is due to the applied word wrap. Also clear is that the last word effect completely disappeared. One can see that intuitively: whenever a word does not fit, the word before becomes the last word. Assuming their lengths to be independent, the word before follows the overall length distribution.  

A first conclusion:  

Words at the start of a line in the VMs are about 0.3 letter longer, but that is to be expected from starting new lines.  

Words at the end of a line in the VMs are about 0.5 letter shorter than expected. At least two possiblities: these words are broken-off, these words were shortened.  


But let us first see if the effects can be found in other texts as well. A test on Tom Sawyer (Mark Twain), skipping the conversations, gave this result:

Table 4: Tom Sawyer (Mark Twain)
         # words: 64249  # lines: 4917  
         Total average wordlength:  4.238
==================================================
position  1: 4.92  - position from end  1: 4.15  
position  2: 4.21  - position from end  2: 4.27  
position  3: 4.25  - position from end  3: 4.33  
position  4: 4.32  - position from end  4: 4.25  
position  5: 4.28  - position from end  5: 4.24  
etc...
===================================================

Indeed it shows the first word effect. It is even bigger than in the VMs: 0.7
However almost no last word effect! In that text there were no words broken off at the end of the lines.


To stick to the same word length distribution as the VMs, I generated a random text about the same size as the VMs. From Stolfi's web page(1) I estimate that a Poisson distribution with lambda=4, shifted +1 is a good approximation. The resulting first and last word averages are given in the next table:

Table 5. Average wordlengths in lines from "Poisson" text.  
         Average word size 5.0,  
         number of words: 40000, width 80 columns.
=========================================================
position  1: 5.72  - position from end  1: 4.95  
position  2: 5.03  - position from end  2: 5.04  
position  3: 5.07  - position from end  3: 5.01  
position  4: 5.04  - position from end  4: 5.00  
position  5: 5.03  - position from end  5: 5.03  
==========================================================

The first word effect is as pronounced as in Tom Sawyer. I did the same test with a completely different word length distribution (triangular) and got almost the same result. This shows the longer first words are a "law of nature".

By the way. With a few simple assumptions one can prove that the actual average first word length for this Poisson distribution is about 5.67, quite close. When interested you can find it in the mathematical discussion below.



A few things remain: First of all, does this effect in the VMs depend on the transcription chosen? I don't think so. It is a matter of space. Think of the words in centimeters. Irrespective of the transcription, a long word in cm's is more likely to have to go to the next line than a short word in cm's. Ofcourse the average wordlength would be different, but still the first words would be about 10% longer. Also the width of spaces with respect to the character width might change. The final part of the mathematical analyses below could easily be adapted to incorporate this.


Then there are the often very regular right margins in the VMs. I must say far more regular than to be expected from whole words moving to the next line. One explanation is the following: The text was first written as a draft with irregular right margins. Then the copyist stretched the words a bit nearing the end of line when needed. There are more signs that the text was copied from draft: around plants it all fits very well. The circular texts fit perfectly.



Remains the effect of the half a letter shorter last words!

At first I thought that could come from the last word being broken off (in two). But that does not seem likely: for each broken word part on the current line, the remaining (short) word part goes to the next line. The part on the current line would indeed reduce the last word length. The part on the next line would do the same, and reduce the first word length by as much.  

From Table 3 in the regrouped VMs we find that due to word wrap alone in the VMs the first word would be 0.45 longer. From Table 2, if the last words are indeed broken off they take 0.45 off the average last word. If words are on average broken in the middle, that should take also 0.45 off the first word length, leaving no effect! It is however 0.3 (Table 2). So breaking off seems unlikely.


Another possibility is: At the end of a line, when a word would not fit, it is shortened in stead of going to the next line. That new line would now start with a word of "merely" average length. To see if the statistics allow for this, an example:

Suppose one out of four of the end words are shortened by two letters, that is 0.5 letter on average, then:  
One out of four: the first word next line is of average size.  
Three out of four: word wrap is applied, and the first word is 5.45 on average (Table 3).  
The first word effect in this example is the weighted average of 5.00 and 5.45, being 5.34.  
This proves that the 5.28 found in the VMs Table 2 is possible with shortened last words, may it be with numbers different from those assumed in this example.

The idea of shortened or broken-off words in the VMs is enhanced by the observation that the last words of paragraphs are on average not shorter. They are not affected by word wrap.

To do: A simulated break-off and shorten test might give more insight in the statisitical possibilities.



A third cause has to be considered: Long words are more likely preceeded by short words. E.g. articles and propositions. Whenever a long word has to go to the next line, a shorter (than average) word remains at the end of the current line. That looks very promising, but the test on the regrouped VMs (table 3) puts an end to that possiblity. In regrouping I left the word order the same. The effect would have to show up in that test as well. Since it did not, we have an indication that longer words are not preceeded by on average shorter words. Performing a quick test on all words, calculating the average length of the words preceeding a word of certain lenth, proves this:

Table 6. Average length of words  
         preceeding a word of given length
==========================================
length: 1 - previous word  4.21 -   867
length: 2 - previous word  4.57 -  2573
length: 3 - previous word  4.77 -  4130
length: 4 - previous word  4.81 -  7868
length: 5 - previous word  4.99 - 10737
length: 6 - previous word  5.20 -  8099
length: 7 - previous word  5.34 -  4821
length: 8 - previous word  5.34 -  1846
length: 9 - previous word  5.28 -   628
length 10 - previous word  5.53 -   163
===========================================

The last column are the number of words in that length class.
It does show rather the opposite. There is a very strong positive correlation between the lengths of two sucessive words. Another striking feature of the VMs, since languages like Dutch and English show a small negative correlation.



Conclusion.  

The first words on each line in the VMs are on average 0.3 EVA letter longer than words in other positions. From a test on regrouped VMs words, actual languages and simulated texts this can be fully explained as something that happens to every text, where words are moved to start the next line, whenever they don't fit at the end (word wrap). Longer words are more likely moved that way.  

The last word on each line in the VMs is on average 0.5 EVA letter shorter than words in other positions. Tests show no explanation from just the "physical" ordering of the words. Statistics show that it is unlikely the result of breaking the words off (in two parts) at the end of a line, where the second part starts the new line. Statistics allow that shortened last words are the cause.

The result supports the idea that the VMS is not a hoax. Hoax in the sense that the "words" are meaningless, just a row of characters. If the hoaxer had produced a text and written it, using word wrap, ofcourse he would get the higher first word average. But the conclusion above indicates that often words at the end of a line were shortened. If we can recognize the shortened words for what they stand for that makes sense, but for a meaningless row of characters it does not.  

The result allows for the VMs to be a transcribed language (be it artificial or not). It allows for it to be a coded language, but then likely coding word by word, leaving the relative word lengths untouched, including the shortened words. This to explain the effects found.

A test on the regrouped VMs suggests that longer words are not preceeded by on average shorter words, e.g. prepositions and articles. On the contrary, a quick count shows a positive correlation between successive lengths.

Ger Hungerink


(1) Stolfi:
http://www.dcc.unicamp.br/~stolfi/voynich/00-12-21-word-length-distr/


====================================================================

Mathematical discussion:

Solving the problem "forwardly" is extremely complicated. That is: take all distributions into account that force a word wrap at the end of a line of e.g. 80 characters. Calculate the expected length of the word that does not fit at the end.

I have attacked the problem "from behind". Look at a very long line of text. Choose a point at random in that text to be the margin. It is clear that  
1) longer words are more likely "hit" (proportional to their size plus one space), and that  
2) more frequent words are more likely hit (proportional to their frequency).

This is an approximation if we apply it to finite lines. But for lines of sufficient length (more than a few words) the approximation is extremely close. As is shown by the simulations.



The mathematics.

Assume the wordlengths k to follow a Poisson distribution P(k) with parameter lambda (denoted x), shifted by +1.  
Derivatives are denoted by ()' ()"

Poisson:   P(k)= exp(-x).x^k/k!  k=0,1,...
Poisson+1: P1(k+1)= exp(-x).x^k/k!  k=0,1,...

Sum of P(k): SUM exp(-x).x^k/k!  
           = exp(-x).SUM x^k/k!  
           = exp(-x).exp(x)     = 1

Expectation P(k)= SUM k.exp(-x).x^k/k!  
          = exp(-x).SUM k.x^k/k!        
          = exp(-x).SUM x.(x^k/k!)'    
          = exp(-x).x.exp(x)           = x

Expectation P1(k)= SUM (k+1).exp(-x).x^k/k!            
                 = exp(-x).SUM (k+1).x^k/k!            
                 = exp(-x).[SUM k.x^k/k! +SUM x^k/k!] = x+1




A word will have to go to the next line, whenever the right margin "happens" to cross that word.

Assume the right margin to fall randomly into a length k such that
1) the probability is proportional to k.
2) the frequency of length k is proportional to P(k).

Each chance is proportional to  k.P(k).
SUM k.P(k) is the total of proportions k=0,1,... and equals x.

The probability for the margin M(k) to fall into length k equals:
M(k)= k.P(k)/[SUM k.P(k)] = k.P(k)/x

The average size of the length the margin will fall in, equals:
Expectation M(k) = SUM k.k.P(k)/x

Calculate SUM k^2.P(k)
        = SUM k^2.exp(-x).x^k/k!  
        = exp(-x).SUM k^2.x^k/k!

using     SUM k^2.x^k/k! - SUM k.x^k/k!  
       =  SUM k.(k-1).x^k/k!             
       =  x^2.SUM (x^k/k!)"           = x^2.exp(x)

you will find: SUM k^2.P(k)
             = exp(-x).[x^2+x].exp(x) = x^2+x

Expectation M(k) = SUM k.k.P(k)/x  
                 = [x^2+x]/x          = x+1




Nearly there...  

This is the expected length k the right margin will fall into, if k has a Poisson distribution P(k).

We want the the lengths to follow a shifted Poisson distribution P1(k).
On top of that, each word is separated by a space of length 1 from the previous word. The margin will have to fall into a length consisting of the wordlength k+1 plus space, total length k+2, such that:

1) the probability is proportional to k+2 (word +shift +space).
2) the frequency of length k+2 is proportional to P(k), k=0,1,....

Each chance is proportional to  (k+2).P(k).
SUM (k+2).P(k) is the total of proportions k=0,1,... and equals x+2.

The probability for the margin M(k) to fall into length k plus space equals:
M(k)= (k+2).P(k)/[SUM (k+2).P(k)] = k.P(k)/(x+2)

The average size of the length plus space the margin will fall in, equals:
Expectation M(k) = SUM (k+2).(k+2).P(k)/(x+2)
                 = SUM (k^2 +4k +4).P(k)/(x+2)
                 = [SUM k^2.P(k) +SUM 4k.P(k) SUM 4.P(k)]/(x+2)
                 = [ x^2+x +4.x +4 ]/(x+2)
                 = [x^2 +5x +4]
                 = x+2 + x/(x+2)

Taking off the space again, leaves for the average first word length in word wrap:

                   x+1 + x/(x+2)


We're there.  

Now let's see what this means for our Voynich simulation with lambda (i.e. x) =4 and shift +1.  
The word length will on average be 4+1 = 5
The margin will fall on average in lengths 4+2 +4/6 = 6.67
Whenever this happens that word goes to the next line as initial word.
Taking off the space that leaves 6.67-1 or 5.67 as the average first word length.

Ger Hungerink
January 26, 2008.