1 Abstract

The Greek New Testament was copied by hand for almost fifteen centuries until the advent of mechanized printing provided an alternative means of propagation. Translations into other languages were produced as well. Some of these — such as the Latin, Coptic, and Syriac versions — appeared early and thus preserve ancient states of the text. Patristic citations form another class of evidence that allows varieties of the text to be associated with particular localities and epochs. Analysis of textual variation allows relationships between these three classes of witnesses — manuscripts, versions, and patristic citations — to be explored. This article applies various kinds of analysis to textual variation data collected from a variety of sources. The analysis results offer complementary views of the textual space occupied by these venerable witnesses to the New Testament text.

2 Introduction

As with every widely read work from antiquity, the New Testament exhibits textual variation introduced by scribes and correctors. Sites where textual variation occurs are identified by comparing extant witnesses of the text.1 Differences may be classified as orthographic or substantive. Orthographic variations are often ignored as they only affect the surface form of a text and not its meaning. Substantive variations do affect meaning: they are often called readings or variants The list of witnesses that support a particular reading of a particular variation site is known as the reading’s attestation. A list of all readings at a variation site along with the attestation of each reading is called a variation unit. Critical editions often present variation units in an apparatus. The present study is based on analysis of data sets extracted from critical editions, monographs, journal articles, dissertations, or online databases.

There is an ongoing effort to establish the initial text which stands behind the range of texts found among surviving witnesses of the New Testament.2 The most important witnesses for establishing the initial text fall into these categories:

  • Greek manuscripts

  • ancient versions

  • patristic citations.

Greek manuscripts are the primary witnesses to the text of the New Testament. Ancient versions are early translations of the Greek text into languages such as Latin, Coptic, Syriac, and Armenian. It is often possible to establish which Greek reading a version supports by translating its text at a variation site back into Greek. Patristic citations are quotations of the scripture by Church Fathers. Which reading was in a Church Father’s copy of the text at a particular variation site can often be discerned if that part of the text is covered by one of his quotations.

A large proportion of the textual evidence disappeared long ago. Even a comprehensive data set that includes all readings of all extant witnesses is still a mere sample of what once existed. In general, the older the copy, the more likely it is to have been lost. This loss of data presents a fundamental problem: if extant texts do not represent the oldest copies then the survivors will give a skewed impression of the initial text. Happily, there is a way forward: if like texts are grouped and more or less accurate representations of the archetypal texts that gave rise to the groups can be reconstructed then these archetypes have a claim to being more representative of the ancient text. In addition, the extant copies provide data for estimating how accurately the copyists practiced their art. Armed with a knowledge of the number of generations of copies and typical rates of substituting one reading for another per generation, it is possible to say how much the original text is likely to differ from the initial text recoverable from extant copies. While raw data that could be used for the purpose is presented below, I do not propose to reconstruct hypothetical archetypes or estimate rates of change here. Instead, this article will focus on presentation and analysis of available data to explore grouping among the extant texts.3

Even though much is lost, a stupendous amount of evidence remains. There are many thousands of manuscripts in Greek, Latin, Syriac, Armenian, and other languages. Patristic citations are also very numerous. Given such a great cloud of witnesses, it can be difficult to see where each one stands in relation to the others. Fortunately, various methods of statistical analysis can be applied to data sets which relate to textual variation in order to explore relationships among the witnesses.

Analysis might begin from a number of starting points. One suitable place to begin is a data set derived from a critical apparatus which gives attestations (i.e. lists of witnesses) in support of readings found at variation sites. In nearly all cases, practical considerations restrict an apparatus to presenting a sample of extant texts. Results obtained by analysis of these data sets are therefore provisional because it is always possible that including further data would produce different results. However, it is reasonable to expect that analysis results will approximate those that would be obtained if a more comprehensive data set were analysed provided that the sample is sufficiently large and has been selected without systematic bias.

The information contained in an apparatus must first be encoded as illustrated by reference to this entry from the fourth edition of the United Bible Societies’ Greek New Testament (UBS4):

Figure 1. Apparatus entry (Mark 1.1, UBS4) Apparatus entry (Mark 1.1, UBS4)

The data sets presented in this article use a number of encoding conventions. Exotic characters and superscripts can cause problems when plotting analysis results so witness identifiers (i.e. sigla) are Romanized and superscripts are replaced by hyphenated sequences of characters. Apart from these changes, the method of identifying witnesses used by the source of a data set is usually retained. Be warned, dear reader: this approach is liable to cause confusion when two sources use different identifiers for the same witness. For example, Codex Sinaiticus may be identified as Aleph or 01. Also, the critically established text used in the INTF’s Editio Critica Maior may be referred to as A (for Ausgangstext), making it easy to confuse with the A often used to represent Codex Alexandrinus.

When it comes to encoding apparatus entries, the textual states found among the witnesses can be represented by numerals, letters, or other symbols. In the present example, the first reading is encoded as 1, the second as 2, and so on. The state of a witness is classified as undefined and encoded as NA (for not available) when it is not clear which reading the witness supports. For manuscripts this may be due to physical damage or because the manuscript does not include the section of text being examined; for versions, it may not be clear which state of the Greek text is supported by a back-translation of the version; for patristic citations, the reading of a Church Father’s text may be unclear if the quotations are not exact (e.g. adaptations, allusions, or quotations from memory) or if different witnesses of the Church Father’s text have different readings. In the present example, a number of versions (Latin, Syriac, Coptic) and patristic citations (e.g. those of Irenaeus, Ambrose, Chromatius, Jerome, and Augustine) are treated as undefined because it is not clear which readings they support at this variation site.4

Table 1. Codes for readings (Mark 1.1, UBS4) Code Variant Attestation 1 Χριστου υιου θεου UBS Aleph-1 B D L W 2427 2 Χριστου υιου του θεου A Delta f-1 f-13 33 180 205 565 579 597 700 892 1006 1010 1071 1243 1292 1342 1424 1505 Byz E F G-supp H Sigma Lect eth geo-2 slav 3 Χριστου υιου του κυριου 1241 4 Χριστου Aleph Theta 28-c syr-pal arm geo-1 Origen Asterius Serapion Cyril-Jerusalem Severian Hesychius Victorinus-Pettau 5 omit 28 Epiphanius NA undefined it-a it-aur it-b it-c it-d it-f it-ff-2 it-l it-q it-r-1 vg syr-p syr-h cop-sa cop-bo Irenaeus Ambrose Chromatius Jerome Augustine Faustus-Milevis

Encoded readings are entered into a data matrix which has a row for every witness and a column for every variation site. The appropriate code is entered at the cell corresponding to a particular witness and variation site, namely that cell located at the intersection of the witness row and variation site column. Manuscript correctors are treated as separate witnesses, as are supplements.

Figure 2. Part of a data matrix (Mark, UBS4) Part of a data matrix (Mark, UBS4)

The next step is to construct a distance matrix which tabulates the simple matching distance between each pair of witnesses sufficiently represented in the data set. The simple matching distance between two witnesses is the proportion of disagreements between them at those variation sites where the textual states of both are defined. Being a ratio of two pure numbers, this quantity is dimensionless (i.e. has no unit). It varies from a value of zero for complete agreement to a value of one for no agreement.5 A witness only qualifies for inclusion in a distance matrix if all distances for that witness are calculated from at least a minimum number of variation sites. This constraint is intended to reduce sampling error to a tolerable level. It is enforced by a vetting algorithm that progressively drops witnesses with the least numbers of defined variation sites until all distances in the distance matrix are guaranteed to have been calculated from a minimum acceptable number of sites. The minimum acceptable number for the distance matrices of this study is nearly always set at fifteen.6

Figure 3. Part of a distance matrix (Mark, UBS4) Part of a distance matrix (Mark, UBS4)

Computing Environment

Various analytical methods can applied to a data set derived from a critical apparatus to explore relationships between witnesses. All of the results presented in this article are obtained using a statistical computing language called R. The analysis is performed by means of R scripts written by the author which are available here. The R program and additional packages (e.g. cluster, rgl, ape) required to run the scripts can be installed using instructions provided at the R web site.

Readers are encouraged to use the scripts. There are various ways to run a script once the R environment is installed. For users who prefer a command line interface, typing R into a terminal window provides an R prompt. (It helps to change to the directory which holds the scripts before launching R.) A command can then be entered in order to run a script. As an example, the command source(“dist.r”) typed at the R prompt causes the dist.r script to construct a distance matrix from the specified data matrix. Parameters such as paths to input and output files are specified in the scripts, which users are free to edit. Data Sets

The data sets analysed in this article derive from various sources. Each source is assigned an identifier based on the author or party who produced it. A source is often used to produce data sets for a number of New Testament sections such as individual gospels and letters. Each analysis result is keyed to the relevant section and source identifier so that its underlying data set can be identified.

The data sets generally retain the symbols used by their associated sources to represent New Testament witnesses. Some represent manuscripts by Gregory-Aland numbers (e.g. 01, 02, 03, 044) while others use letters or latinized forms (e.g. Aleph, A, B, Psi). These symbols carry through to the analysis results. In INTF data, ECM or A (for Ausgangstext or initial text) represents the text of the Editio Critica Maior. The A for Ausgangstext in INTF data sets should not be confused with the A for Codex Alexandrinus in other data sets. Also beware of confusing texts when the same letter (e.g. D, E, F, G, H, K, L, P) refers to different manuscripts in different parts of the New Testament. Abbreviations UBS, WH, and TR stand for the texts of the United Bible Societies’ Greek New Testament, Westcott and Hort’s New Testament in the Original Greek, and the Textus Receptus, respectively. Maj, Byz, and Lect stand for majority, Byzantine, and lectionary texts, respectively. The relevant printed editions should be consulted for explanations of what these group symbols represent.

A source may be in the form of apparatus entries, tables of percentage agreement, or lists of pairwise proportional agreement. If the source is an apparatus then it is used to construct one data matrix per desired section. Each data matrix includes those witnesses and variation sites covered by the apparatus, using symbols such as numerals or letters to encode reported textual states (i.e. readings). A distance matrix is then constructed from the data matrix. If the source only reports percentage or proportional agreement between witnesses then a distance matrix is constructed directly from the agreement data and no data matrix is produced. Distances are usually specified to three decimal places regardless of whether this level of precision is warranted.

Analysis cannot proceed if a distance matrix has missing entries. This problem can be avoided by manually producing multiple distance matrices from the same source data, each omitting a particular witness whose inclusion would create an empty cell. This is done for a number of the distance matrices presented below, including Brooks’ table for John (where there is a missing cell for C and Old Latin j) and Fee’s table for John 1-8 (which lacks cells where the first hand and corrector intersect for P66 and Aleph).

Distance matrices are normally obtained by applying the default vetting algorithm, which drops the least defined witness of each pair used to calculate a distance until all distances are calculated from the minimum acceptable number of variation sites where both are defined, which is normally fifteen. In some cases, an alternative approach is used which forces a particular witness to be retained provided it has enough defined variation sites at the outset. Examples include UBS2 distance matrices for Matthew, Mark, John, and Acts where Alexandrinus (A), Ephraemi Rescriptus (C), Sinaiticus (Aleph), and P45 have been retained due to their importance.

It is helpful to know what analysis results look like when there is no clustering among the objects being analysed. (Generic terms such as object, observation, case, or item may be used for the things being compared when they are not necessarily New Testament witnesses.) We have a natural facility for recognising group structure but are also prone to mistake a purely random distribution of items for a cluster. One way to avoid this kind of error is to be familiar with analysis results produced from a data set that has no group structure. With this purpose in mind, a control data set may be generated which is analogous to its model data set in various respects (e.g. number of objects, number of variables, mean distance between objects) but has no actual clustering among its objects.

A control data set is generated by performing c trials to randomly select one of two possible states (1 and 2) then repeating this r times to produce a data matrix with r rows of objects and c columns of variables. The generator aims to produce objects which have a mean distance of d between them. Values for r, c, and d are derived from the model: r is the number of objects in the model distance matrix; c is the rounded mean number of variables in the objects from which the model distance matrix was calculated; and d is the mean of distances in the model distance matrix. The control data matrix is then used to calculate a control distance matrix which has the same number of objects as the model and approximately the same mean distance between objects.7

The binomial distribution predicts the range of distances expected to occur between pairs of objects generated in this way. A 95% confidence interval is the range of distances expected to occur for 95% of randomly generated cases. Only 5% of distances between two randomly generated objects fall outside the upper and lower limits defined by this interval. A distance outside this range, either less or more, is statistically significant in the sense that it is unlikely to happen by chance (though there is a 5% chance it will). A distance outside the normal range defined by the 95% confidence interval indicates an adjacent or opposite relationship between two objects: adjacent if the distance is less than normal and opposite if greater.8

While distances outside the normal range are unlikely to occur by chance, a distance inside that range does not necessarily imply lack of relationship between two objects: a relationship between the two may exist but it is not possible to say so with confidence. The relative size of the normal range contracts as the number of places compared increases so a distance which is not statistically significant in one data set may be statistically significant in another which includes more variation sites.

The following table presents the data sets and their sources. Links in the table provide access to data and distance matrices which are formatted as comma-separated vector (CSV) files so that they can be downloaded and imported into a spreadsheet program. A distance matrix is always provided but a data matrix is only included if one has been constructed. If there is no data matrix then NA for not available is entered in the relevant column.

Table 2. Data sets and their sources Source Description Section Data matrix Distance matrix Brooks Tables of percentage agreement from James Brooks’ New Testament Text of Gregory of Nyssa covering: Matthew (table 1, 58-9); Luke (table 7, 90-1); John (table 13, 138-9); and Paul’s Letters (table 18, 254-5). These were transcribed by Richard Mallett. Matthew NA → Luke NA → John (C) NA → John (it-j) NA → Paul’s Letters NA → CB Data matrices for each Gospel compiled by Richard Mallett using Comfort’s New Testament Text and Translation Commentary and Comfort and Barrett’s Text of the Earliest New Testament Greek Manuscripts. Matthew → → Mark → → Luke → → John → → Cosaert Data matrices for each Gospel compiled from apparatus entries in Carl P. Cosaert’s Text of the Gospels in Clement of Alexandria. Matthew → → Mark → → Luke → → John → → Cunningham Tables of percentage agreement for the Gospel of John and Paul’s Letters from Arthur Cunningham’s “New Testament Text of St. Cyril of Alexandria,” 421-2 and 753. Associated tables of counts are on pages 423-4 and 754. John NA → Paul’s Letters NA → Donker Data matrices for Acts, the General Letters, and Paul’s Letters from Gerald Donker’s Text of the Apostolos in Athanasius of Alexandria. Gerald Donker and the SBL have made this data available through an archive located at sbl-site.org/assets/pdfs/pubs/Donker/Athanasius.zip. May their respective tribes increase! Acts (all) → → Acts 1-12 → → Acts 13-28 → → General Letters → → Paul’s Letters → → Romans → → 1 Corinthians → → 2 Cor. - Titus → → Hebrews → → EFH Data used by Jared Anderson for his ThM thesis, “Analysis of the Fourth Gospel in the Writings of Origen.” The data was originally collected by Bart D. Ehrman, Gordon D. Fee, and Michael W. Holmes for their Text of the Fourth Gospel in the Writings of Origen. (Bruce Morrill did the statistical analysis presented in that volume.) A revised version of Anderson’s thesis will be published in SBL’s New Testament in the Greek Fathers series. John → → Ehrman Table of percentage agreement for the Gospel of Matthew from Bart Ehrman’s Didymus the Blind and the Text of the Gospels. This was transcribed by Richard Mallett. Matthew NA → Fee Tables of percentage agreement from three articles by Gordon Fee: (1) a table covering Luke 10 from “The Myth of Early Textual Recension in Alexandria”; (2) tables covering John 1-8, John 4, and John 9 from “Codex Sinaiticus in the Gospel of John”; (3) another table covering John 4 but including patristic data from “The Text of John in Origen and Cyril of Alexandria.” Two distance matrices are produced for each table of percentage agreement with a blank entry for agreement between the first hand and corrector of a manuscript. Luke 10 NA → John 1-8 NA → John 1-8 (corr.) NA → John 4 NA → John 4 (corr.) NA → John 4 (pat.) NA → John 4 (pat., corr.) NA → John 9 NA → John 9 (corr.) NA → Hurtado Tables of percentage agreement from Larry Hurtado’s Text-Critical Methodology and the Pre-Caesarean Text. There is one table for each of the first fourteen chapters of the Gospel of Mark, one for Mark 15.1-16.8, and another for places where P45 is legible. Data from an augmented version of Hurtado’s P45 table is presented below in the Mullen source entry. Mark 1 NA → Mark 2 NA → Mark 3 NA → Mark 4 NA → Mark 5 NA → Mark 6 NA → Mark 7 NA → Mark 8 NA → Mark 9 NA → Mark 10 NA → Mark 11 NA → Mark 12 NA → Mark 13 NA → Mark 14 NA → Mark 15.1-16.8 NA → Mark (P45) NA → INTF-General Distance matrices derived from information in a database related to the INTF’s Novum Testamentum Graecum: Editio Critica Maior: Catholic Letters volumes. The INTF kindly granted access to this data. James NA → 1 Peter NA → 2 Peter NA → 1 John NA → 2 John NA → 3 John NA → Jude NA → INTF-Parallel Distance matrices made from tables located at http://intf.uni-muenster.de/PPApparatu