In the following post, Dr. Michaela Vance writes about her pilot project undertaken with the CDHU–an author attribution study of Frances Brooke’s libretto, Marian.
About six months ago, in a moment of unbridled optimism, I set out to investigate whether the second, performed version of Frances Brooke’s libretto Marian was actually and truly written by her, or if the changes to the same could be reasonably assumed to be the work of a second (unknown) writer. Since then, I have come to have a greater appreciation of the difficulties involved in attributing authorship to short texts. In the following, I will give a short account of a workshop on the subject, hosted and organized by the Centre for Digital Humanities at Uppsala University, following my successful pilot project application to the same.
In preparation for the workshop I made a corpus of texts of a similar era and genre, against which we could run the two librettos in question. These included some of Brooke’s earlier texts such as the libretto to Rosina, the tragedies Virginia and Siege of Sinope, and sections of her novels. In addition, I included a selection of texts by other writers, such as Dibdin’s Shepherdess of the Alps, O’Keefee’s The Wicklow Mountains, and Bickerstaff’s Love in a Village, and Hanna Moore’s tragedies Percy and Fatal Falsehood. An initial problem was the OCR quality of these texts – firstly, because there exists no OCR .txt copy of the manuscript of Marian, and secondly, because the automatic OCR renderings of old printed texts are often far from ideal. However, I managed to make a decent OCR copy of the manuscript version of Marian with the help of Transcribus, and Ekta Vats made an brilliant OCR copy of the printed version of Marian. The rest of the texts were found at the excellent open source project ECCO-TCP, which hosts a large collection of SGML/XML-encoded texts.
Having addressed the OCR issues in the week before the workshop, project coordinator Karl Berglund, research engineer Marie Dubremetz and I set out to get some initial results in Stylo (run in R) when we met at Uppsala. Trying first Cosine Delta, the texts clustered based on stylometric similarity. As can be seen in Figure 1 this worked rather well in many ways, but it did not give enough of an indication that Marian differed to any substantial degree from Brooke’s other texts in terms of authenticity. We moved on to the second method, “rolling classify”, but here we ran into issues repeatedly due to the text in question being too short. Intriguingly, both methods cluster the texts by women (Brooke and More) together, while the texts written by men are quite clearly clustered on a separate branch.
In the next step, we tried to tackle the issue of the similarity of the two librettos by distilling the later version of Marian into a document that only contains the changes – a kind of super edition here called “disputed Marian new additions cleaned”. Having tried a variety of programs to help me spot and separate changes to the newer version of Marian, I finally gave up and did it by myself by just placing the versions next to each other and marking each change by hand. Given the very limited length of the texts and the fact that the librettos were interspersed with songs, we had low hopes of getting clear results. Nevertheless, as can be seen in Figure 3-5, some interesting results came out of the new Stylo effort. First, we can see that the variation with which the texts are grouped depending on method is quite notable, but also, that the women writers continue to consistently group together. Second, in figure 4 the distilled document with all of the isolated changes to Marian is further away from the manuscript version of Marian than Rosina is. This is intriguing, as it indicates greater variety of style between the two versions of Marian than between two entirely different librettos by Brooke. Caution needs to be exercised, however: the texts cluster on the level of author probability rather than text-specific features as such, and both versions of Marian and Rosina are situated on the same branch. Getting less ambiguous results would require further testing, with even greater attention to stylistic, syntactic, and morphological data, especially as we cannot speculate as to who might have made the changes to the original manuscript, and therefore do not have access to a reliable selection of comparison texts.
I went in to the project feeling fairly certain that Brooke had not made the changes to Marian herself, but now, even with such uncertain results, I find myself more open to the idea that she did indeed make those changes. Part of this has to do with the level of attention I paid to the smallest details of the two scripts as I prepared them for the Stylo experiments. The method encouraged me to consider aspects of the changes that I had not really registered before – an incidental but positive aspect of stylometry that, somewhat ironically, brings it back to the methods that predates the computer. I hope that I will be able to discuss the changes and what they might mean for how we understand Brooke’s ideas about class and genre in a different outlet in a not-too-distant future.
If you would like to read more about stylometric methods and short texts, I recommend checking out “The Dynamiter” project at http://thedynamiter.llc.ed.ac.uk/ and three articles, Hirst and Feguina’s “Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts” in Literary and Linguistic Computing (2007), Gorman’s “Author identification of short texts using dependency treebanks without vocabulary” in Digital Scholarship in the Humanities, and Corrinne Harol, Brynn Lewis and Subhash Lele’s “Who Wrote It? The Woman of Colour and Adventures in Stylometry” in Eighteenth-Century Fiction.
-Dr. Michaela Vance