Brian Wichmann is the Editor of *Voting matters* and a visiting
professor of The Open University.

An obvious question to raise is if the information provided in a ballot can somehow be simplified to provide the essential content. In this paper, a simple model is proposed which appears to provide the essential information from a preferential ballot.

Hence we consider the case with four candidates: Albert, Bernard, Clare and Diana, with the votes cast as follows:

20 AB 15 CDA 4 ADC 1 BFrom this data, we compute the number of each pair of preferences, adding both the starting position and a terminating position. For instance, the number of times the preference for A is followed by B is 20, and the number of times the starting position is 'followed by' A is 20+4=24. The complete table is therefore:

A B C D e s 24 1 15 0 - A - 20 0 4 15 B 0 - 0 0 21 C 0 0 - 15 4 D 15 0 4 - 0Obviously, a preference for X cannot be followed by X, resulting in the diagonal of dashes. The entry under

Having now computed this table, we can use it to characterise voting
behaviour. For instance, 24 out of 40, or 60% of voters gave A as their
first preference. More than this, we can use the table to compute ballot
papers having the same statistical properties. For example, if the first
preference was A, then the second row of the table shows that the subsequent
preference should be B, D or `e` in the proportions of 20:4:15. Due
to the fortunately large number of zeros in the table, we can easily compute
the distribution of all the possible ballot papers which can be constructed
this way. Putting these in reducing frequency of occurrence we have:

AB 30.8% (50.0%) A 23.1% CDAB 16.9% CDA 12.7% (37.5%) C 7.9% ADC 6.1% (10.0%) B 2.5% ( 2.5%)The figures in brackets are the frequencies from the original data - which can be seen to be quite different.

A number of points arise from this example:

- The computation of the frequencies needs to take into account the valid preferences. For instance, the frequency of the ballots starting AD is 0.6×4/39 = 6.1%; the next preference can only be C, since the other option in the table is A which is invalid.
- The large percentage that plump for A is due to the combination of the large percentage having A as the first preference, and the large percentage having A as the last preference, even though plumping for A does not occur in the original ballot. One would not expect this to occur in practice.
- In this example, the table seems to be larger than the original ballot papers in information content. Exactly the opposite would occur with real elections with hundreds or more ballot papers.
- Note that the number of occurrences of A in the ballot papers is the sum of the column A and also the sum of row A (which are therefore equal).
- It is clear that the ballot papers constructed this way do not have the same distribution of the number of preferences as the original data. However, the mean number of preferences is similar, but smaller (2.19 for the computed data, 2.45 for the original). Clearly, when all ballot papers give a complete set of preferences, the computed data will rarely, but sometimes, give plumping.
- If the voters voted strictly according to sex (A,B or C,D), then this characteristic would be preserved by the model. Similarly, the model does characterise party voting patterns.

The conclusion so far is that the model characterises some aspects of voter behaviour, but does not mirror other aspects. However, from the point of view of preferential voting systems, we need to know if the characterization influences the results obtained by a variety of STV algorithms. The property can be checked by comparing sets of ballot papers constructed by the above process against those produced by random selection of ballot papers from the original data.

We take the ballot papers from a real election which was to select 7 candidates from 14, being election R33 from the STV database. From this data, which consists of 194 ballot papers, we select 100 elections of 25 votes by a) producing random subsets of the actual ballots, or by b) the process described above.

For each of the 200 elections we determine 4 properties as follows:

- Determine if the Condorcet top tier consists solely of the candidate G. This was a property of the actual election.
- Determine if the Meek algorithm elects candidate C. This was a property of the actual election.
- Determine if the ERS hand counting rules elects candidate N. This was a property of the actual election.
- Determine if Tideman's algorithm elects candidate E. This was not a property of the actual election. Unfortunately, computing the result from this algorithm can be very slow, and hence the result was determined for 50 elections rather than the 100 for the other three cases.

Subset Process Number Condorcet (G) 75 67 100 Meek (C) 42 34 100 ERS (N) 56 47 100 Tideman (E) 14 20 50I believe that the four properties above are sufficiently independent, and the elections themselves independent enough to undertake the chi-squared test to see if the two sets of elections could be regarded as having come from the same population. Passing this test would indicate that the statistical construction process is effective in providing 'election' data for research purposes.

The statistical testing is best done as a separate 2 × 2 table test of each line. The first line, for example, gives the table

Condorcet Analysis (G) other Subset 75 25 100 Process 67 33 100 ----------------- 142 58 200The four tables give P = 0.28, 0.31, 0.26 and 0.29 respectively, using a two-tailed test. So, so far as this test goes, these show no significant differences in the two methods.