A computer program is used to simulate transmission of an ancient book through manual
copying. ^{}[1] Processes to mimic import, correction, loss, and recovery of instances of the
text are incorporated, with import and correction processes allowing textual mixture to occur.
Program behaviour is governed by configuration parameters to control the processes and
describe population centres which produce demand for the book as well as virtual copies. A
*trend* parameter sets the degree to which each centre adheres to its own
textual preferences. Simulation results produced using one set of hopefully realistic
parameters are compared with actual data from the Gospel of Mark. A relatively large value of
the trend parameter, corresponding to a strong parochialism within
copying centres, is required to produce results which resemble the actual data. This finding
is not inconsistent with a theory of local texts where each major centre of text production
had a strong tendency to propagate its own flavour of the text.

The following terms are used in relation to the virtual texts produced by the simulation:^{}[2]

- case
An instance of the text represented as a set of characters; also known as a witness.

- character
A place where the text varies; also known as a variation site.

- state
One of the possible states of a character; also known as a reading. In a real textual tradition, states are alternative words or phrases; in this simulation they are numerals.

The program uses stochastic processes to produce virtual copies of an initial text through a series of annual cycles. A string of numerals represents each copy of the text. Every time a copy is made, there is a chance that elements of the text will change state. As a result, copies diverge from the initial text, which is a string comprised entirely of ones:

The annual increment in the population of copied texts is governed by the logistic growth equation:

Here, *growth* is the growth rate, *demand* is the
number of copies required in a place, and *K* is the maximum demand. This
equation initially produces exponential growth which declines to stasis (zero growth) as
demand approaches its maximum.

The program imitates stochastic processes by generating pseudo-random numbers (random
variates) that conform to particular probability distributions. Although the numbers seem to
be random, they are actually generated by a deterministic process which depends on an initial
*seed* value. Consequently, the same sequence of numbers will be
generated every time a particular seed is used.

The program generates two kinds of random variate: uniform and Zipfian. The first kind conforms to the uniform distribution and is therefore equally likely to be any number within a certain range (typically zero to one). A number of operations in the program use uniform variates to randomly select items from a set. Given a probability of selection p, a uniform variate between zero and one is generated for each item. If the variate is less than or equal to p then the corresponding item is selected.

The second kind of random variate conforms to the Zipf distribution. A Zipfian variate has an integral value somewhere in the range of one to a maximum number m (i.e. 1, 2, 3, ..., m). The probability that one of these integers will be selected is inversely related to its rank (i.e. position in the list):

Here, *k* is a normalization constant chosen so that the sum of
probabilities from one to m equals one, *r* is the item's rank, and
*n* is a positive exponent. The larger the value of the exponent, the
more pronounced is the tendency to select integers near the head (i.e. beginning) of the list.
To illustrate, the following table shows the results of generating one hundred Zipfian
variates with seven possible states using various exponent values:

**Table 1. Zipfian variates (100 trials, 7 states, various exponents)**

Rank | Frequency | ||||
---|---|---|---|---|---|

exp. = 0 | exp. = 1 | exp. = 2.5 | |||

1 | 13 | 44 | 80 | ||

2 | 9 | 23 | 12 | ||

3 | 13 | 13 | 3 | ||

4 | 14 | 7 | 3 | ||

5 | 17 | 6 | 1 | ||

6 | 18 | 5 | 0 | ||

7 | 16 | 2 | 1 |

If the exponent is zero then there is no tendency to select items near the head of the list; that is, the variates conform to the uniform distribution, each integer being equally likely to be chosen. The generated variates of the zero exponent column illustrate an important point: even though there is a particular (in this case, equal) probability of selection for each item in the list, the actual number generated varies from trial to trial because of the stochastic nature of the generating process. In the limit of a large number of trials, the average number of times each list item is chosen will tend toward the expected value, which in this case is 100/7 (about 14).

Distributions produced by exponent values near one match various real world phenomena such as word frequencies in a natural language corpus or book borrowing frequencies in a library: the most popular item is chosen about twice as often as the second, about three times as often as the third, and so on. Larger exponent values increase the tendency to select items near the head of the list. Regardless of the exponent, generated variate frequencies will vary about expected values due to the stochastic nature of the generation process.

To choose an item from a list so that items near the head of the list are preferred, a Zipfian variate is generated whose maximum possible value equals the length of the list. The variate's value (which is always an integer) is then used to select the corresponding item in the list.

The program is comprised of a number of components: a *Book* represents
a case; a *Conf* carries configuration data; a *Domain*
represents a collection of places (e.g. the Roman Empire); and a *Place*
represents a region within the domain (e.g. a Roman province such as Syria).^{}[3] Each *Place* has two collections: *current*,
corresponding to accessible cases; and *lost*, corresponding to cases which
have become inaccessible.

A single *Domain* is created which contains a number of
*Places*. The *Domain* then goes through the following
phases:

- prime
Grow demand for copies from the start to publication dates using the logistic growth equation to increment demand each year.

- publish
Introduce the initial text at a specified place and date. The initial text is assigned an identifier which indicates the place of origin and publication date (e.g.

`Iy.65`

). Its characters all have an initial state of`1`

.- propagate
Propagate copies from the publication to end dates, again using logistic growth to increment demand each year. The annual propagation cycle is described in more detail below.

- recover
Recover a proportion of copies.

- write
Write recovered copies to an output file. The output is formatted as a data matrix which gives the identifier and set of character states for each case.

For each year between publication and end dates, the program executes a propagation cycle
for every *Place*. The propagation cycle consists of a sequence of steps
comprised of zero or more of the following operations:

- import
Select and import a case from another place. One effect of the import operation is to allow the text to migrate from the place of publication to other places included in the domain.

- copy
Choose a pattern case (i.e. exemplar) from this place's current collection, copy it, edit selected characters of the copy, then add the copy to the same collection. Copies are made until the supply of current cases matches the demand.

- correct
Choose a subject and reference case from the current collection then make selected characters of the subject conform to corresponding characters of the reference.

- lose
Transfer a copy from this place's current collection to its lost collection.

At the end of each place's annual propagation cycle, demand for copies is incremented according to the logistic growth equation.

The uniform distribution is used to select cases to copy, correct, and lose. Each
*copy* operation randomly selects a pattern case from the current
collection then randomly selects a number of the copy's characters to be edited. Each
*correct* operation randomly selects a subject and reference case from
the current collection. The program generates a proportion between 0 and 1 then randomly
selects a corresponding number of characters for correction against the reference. To
illustrate, if there are 50 characters and the program generates a proportion of 0.2 then ten
(i.e. 0.2 * 50) of the subject case's characters will be randomly selected to be assigned the
same state as the corresponding character of the reference case. Each
*lose* operation transfers a randomly selected case from the current to
lost collection.

An identifier comprised of three parts separated by periods is assigned to each copy. The three parts are generated as follows:

a contraction of the place where the copy is made

the production year

the number of the copy produced in that year.

For example, the second copy produced in Italy in 259 would be assigned
identifier `Iy.259.2`

.

The process of choosing a state for each character selected to be edited depends on a
*trend* setting:

- zero
The same set of lists of preferred states per character is used for each

*Place*. Each state of a character is equally likely to be selected.- greater than zero
Each

*Place*has its own combination of lists of preferred states per character. When a character is selected to be edited, states near the head of its list of preferred states are more likely to be chosen to a degree determined by the magnitude of the*trend*parameter. The*trend*parameter is used as the exponent of the Zipfian variate generator used to select an item from the character's list of preferred states.

During initialization, the program assigns each character its own number of states using a
Zipfian variate generator with a specified maximum possible number of states and an exponent
of 1.8.^{}[4] A list of preferred states is then generated for each character. If the
*trend* parameter is zero then lists for corresponding characters are the
same across *Places*, being the numerals from one to the assigned number of
states (e.g. `1 2`

for two states; `1 2 3 4`

for four). A different
method is used if the *trend* parameter exceeds zero: each character of
each *Place* is assigned a list which is a permutation of the numerals from
one to the assigned number of states (e.g. `2 1`

if two states; ```
3 2 4
1
```

if four). That is, each *Place* has its own combination of lists
of preferred states unless the *trend* parameter is set to zero.

Import operations also rely on Zipfian variates. Exporting places are ranked by distance from the importing place using great circle distances calculated from specified latitude and longitude coordinates. A Zipfian probability profile (generated using an exponent of one) is associated with the list of nearest places then used to select a place from which to import a copy.

Once the program reaches the end of the propagation phase (i.e. the end date), it steps
through the *recover* and *write* phases. Recovery
consists of random selection from the current and lost collections of each
*Place* according to two parameters:

ratio of current to lost cases, and

recovery proportion.

The first allows various ratios of current to lost cases to be recovered, with
a ratio of one giving approximately equal numbers of both types. *The initial text is
always included among recovered cases.*

The *write* phase outputs a data matrix whose rows give the identifier
and character states of each recovered case. The data matrix is written to a file formatted as
a comma-separated vector (CSV). This file can then be opened for inspection, processing, and
analysis by *R* or some other application such as a spreadsheet
program.

Operation of the program is governed by an XML configuration file (e.g. `conf.xml`

) which specifies a number of
parameters:

- characters
Number of characters used to represent a case. Too small a number will produce quantization of distances between cases; the larger the number, the slower the simulation will run.

- growth
Initial rate of annual growth for demand of the book. In this simulation, the same rate applies across places in the domain.

- origin
Place where the initial text is introduced.

- ratio
Ratio of current to lost cases recovered. Values larger than one favour current cases while values between zero and one favour lost copies.

- seed
Random number generator seed. Any number will do. Other things being equal, using the same seed will produce the same simulation result.

- states
- trend
Degree to which each place favours its own combination of lists of preferred character states. The larger the value, the more parochial each place with respect to its flavour of the text.

Probabilities are specified as follows:

- corr
Probability (per year) that a case will be corrected.

- edit
Probability (per character copied) that a character will be edited.

- import
Probability (per case required) that a case will be imported from another place.

- loss
Probability (per year) that a case will be lost.

Any number of places may be specified to constitute the domain. Each place has the following components:

- name
Name of the place.

- capacity
Maximum demand for copies in this place.

- lat
Latitude of some point in the place.

- long
Longitude of some point in the place.

- rec
Proportion of cases to recover from this place.

Finally, three dates are specified:

- start
First year of simulation.

- publish
Year when initial copy is introduced.

- end
Last year of simulation.

*R* must be installed on the user's computer before the simulation
can run. Instructions on how to download and install *R* are provided at
the project's website. Much useful
information is available at this site, including platform-specific instructions on a number of
topics such as how to set the working directory.

Subsequent steps will be facilitated by creating a suitable directory (i.e. folder)
structure to hold the simulation program and related files. A parent directory (say,
`Simulation`

) should be created with subdirectories named `scripts`

and `data`

. The program (`sim.r`

) and a configuration file (say, `conf.xml`

) would then be downloaded to the `scripts`

and
`data`

directories, respectively. If desired, other *R*
scripts can be downloaded to the `scripts`

directory from here. Corresponding `dist`

, `cmds`

,
`dc`

, and `nj`

subdirectories could then be created (at the same level
as the `scripts`

and `data`

subdirectories) to capture output from
`dist.r`

, `cmds.r`

, `dc.r`

, and `nj.r`

,
respectively. Finally, the `scripts`

directory should be selected as the working
directory for *R*.

A text editor can be used to edit the configuration file as desired. Once the
*R* environment is launched and the working directory changed to
`scripts`

, the simulation program is run by typing `source("sim.r")`

at the *R* interface's command prompt.

Input and output parameters are specified by editing the following line found towards the end of the simulation program's source file:

`c = Conf$new(conf="conf.xml", dir="../data", rem=2, write=TRUE)`

The values of `conf`

, `dir`

, `rem`

, and
`write`

can be edited to change the configuration file name, data directory path,
level of remarks (0 = minimal; 1 = main phases; 2 = years; 3 = all events), and whether or not
to write the output data matrix to the data directory (TRUE = yes; FALSE = no).

Only a small part of the data generated by the simulation is output to the data matrix.
However, all generated data is accessible via the *Domain* object,
`d`

. Once the simulation has run, features of interest can be inspected by
addressing specific components and subcomponents. To illustrate, the components of
`d`

are listed with this command, typed at the *R* command
line:

`ls(d)`

The following would print the entire contents of the first *Place* in
*Domain*
`d`

, a potentially large amount of data:

`d$place[[1]]`

This would print the first recovered case:

`d$recovered[[1]]`

The ancestry and history of this case would be printed as follows:

`d$recovered[[1]]$ancestry`

`d$recovered[[1]]$events`

The ancestry field lists the identifiers of all ancestors back to the initial text. The
list also includes the identifier of each reference case used to perform a correction
operation somewhere in a case's past. These reference case identifiers are placed in
parentheses (e.g. `(Iy.259.2)`

) to distinguish them from genetic ancestor
identifiers. The events field lists all operations (i.e. import, copy, correct, lose) that
have occurred in a case's history.

The model data set, *Mark-UBS4*, derives from a data matrix constructed
by Richard Mallett using the critical apparatus of Mark's Gospel found in the United Bible
Societies *Greek New Testament*
[UBS4]. The *Mark-UBS4* data matrix was produced by
passing Mr Mallett's data matrix through the `clean.r`

script to remove witnesses with large numbers of undefined
readings.

The simulator-generated data sets are based on three configuration files: `sim-a.xml`

, `sim-b.xml`

, and `sim-c.xml`

. These are identical in all respects except for their
*trend* parameters, which have values of `0`

, `1`

,
and `2.5`

, respectively. The data set generated with a setting of `0`

has no location-specific tendencies with respect to combinations of preferred character
states, while the other two have mild (setting = `1`

) or strong (setting =
`2.5`

) tendencies.

Other parameter settings, which are common to all three configuration files, require some
explanation. Start and end dates bracket the period of interest, while the seed allows the
user to obtain repeatable results. Given probabilities of correction, import, and loss, along
with the growth rate, place of origin, ratio of lost to current copies recovered, maximum
demand for copies in each place, and publication date are not much more than educated guesses.
The number of characters and maximum number of states are based on the model data set. The
recovery proportion for each place is set so that the number of cases recovered is
approximately equal to the number of witnesses in the model data set. The proportion for Asia
Minor is three times that of the other places to produce one textual variety with about three
times as many members as the others. This reflects the disproportionate number of
*Byzantine* witnesses in the model data set.

The number of places in the domain is based on analysis results for the model data set which indicate that the tradition has four major streams. The assumption expressed in the configuration files that these locations are the West, Asia Minor, Syria, and Egypt (centred at Rome, Ephesus, Antioch, and Alexandria, respectively) is speculative. Nevertheless, the number and identity of these places seem reasonable choices given what is known of the spread of Christianity in the first few centuries after Christ's birth, and it is hard to think of any other centres that would have generated greater demand for copies of Christian books during this era. Whereas a configuration with more places would be more realistic, the one given seems reasonable as a first approximation.

Each location has been assigned the same maximum demand for copies, namely three hundred. Without doubt, the respective demands would have differed though who knows to what extent? The assigned figure is a very rough estimate based on these assumptions:

maximum demand for copies occurred about the time of the first Council of Nicaea (325)

maximum demand across Asia Minor, Syria, and Egypt was about nine hundred (based on the figure of about one thousand Eastern bishops supposed to have been invited to the Council)

there is a one-to-one correspondence between copies and bishops.

These assumptions are probably too conservative so the figure of three hundred copies per location probably underestimates the actual maximum demand.

A process of trial and error was used to find the combination of edit probability and
trend value used in the third configuration file (i.e. edit probability of `0.5`

and trend value of `2.5`

). Analysis results for the generated and model data sets
match relatively well for this combination of values. Users are encouraged to experiment with
their own combinations of parameter values in order to find a better match. Changes to some
parameters will have little effect while changing others will have a major impact on the
generated data.

The following table presents the model and generated data sets along with analysis results
derived from them. Distance matrices are derived from the corresponding data matrices using
the script `dist.r`

. Classical
multidimensional scaling (CMDS), divisive clustering (DC), and neighbour-joining (NJ) results
were obtained with scripts `cmds.r`

, `dc.r`

and `nj.r`

.^{}[6]
*Mean chars* gives the average number of characters (or variation sites)
where states (or readings) are defined. *Mean dist.* and *std.
dev.* (standard deviation) figures relate to text-to-text distances recorded in
the distance matrix.^{}[7] The *prop. var.* (proportion of variance) figure indicates how
much of the original distance data is captured by the CMDS result, and the *div.
coeff.* (divisive coefficient) figure indicates degree of clustering in the DC
result.

**Table 2. Model and generated data sets**

Name | Data matrix | Distance matrix | CMDS | DC | NJ | Mean chars | Mean dist. | Std. dev. | Prop. var. | Div. coeff. | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|

Mark-UBS4 | → | → | → | → | → | 123 | 0.471 | 0.170 | 0.51 | 0.74 | Model data. |

sim-a | → | → | → | → | → | 120 | 0.593 | 0.045 | 0.15 | 0.32 | Generated data, trend = `0` . |

sim-b | → | → | → | → | → | 120 | 0.569 | 0.057 | 0.21 | 0.41 | Generated data, trend = `1` . |

sim-c | → | → | → | → | → | 120 | 0.491 | 0.147 | 0.55 | 0.63 | Generated data, trend = `2.5` . |

The *trend* parameter must be set to a relatively high value (e.g.
`2.5`

) for analysis results from model and simulated data sets to be broadly
consistent. Such a high value corresponds to a strong tendency for each place to use its own
combination of preferred character states when editing copies. If the simulator successfully
mimics the actual transmission process then the need to use such a high trend value suggests
that ancient centres of Christianity strongly preferred their own combinations of readings. It
would seem that a theory of local texts deserves renewed attention when seeking to understand
the early history of New Testament textual transmission.

^{[1] }The program, `sim.r`

, is a script
written in the [R] statistical computing language. See the R project web site for more information on
the language.

^{[2] }[SWH2004] introduces *character* and
*state* nomenclature. In multivariate analysis, a general term such
as *case* or *object* is often used for an item
comprised of multiple characters.

^{[3] }For *R* programmers, these components are reference
classes.

^{[4] }This value of the exponent has been found to produce a distribution of numbers of
states not unlike that found in the critical apparatus of the [UBS4]
*Greek New Testament*.

^{[5] }A suitable value would be the maximum number of alternative readings found at a
variation site of the data being modelled.

^{[6] }See [CC1980] chapters 10 and 11 for an introduction to
multidimensional scaling and cluster analysis; [SM1987] introduces
neighbour-joining.

^{[7] }The simple matching distance, which is the ratio of disagreements to places compared,
measures the distance between two texts. Being a pure number, it does not have a
unit.