Lately I have taken an interest in stylometry. After attending some very interesting panels on stylometry at DH2013, I wondered if I could further develop my experiments with Raymond Queneau’s matrix analysis. I had already applied a method using Markov chains to reduce texts to a simplified representation of their syntactic structure according to the schema proposed by Queneau. This method works fairly well for authorship attribution. I have been playing with stylo and learning about cluster analysis, principle component analysis, and other statistical techniques to measure stylistic differences among texts in a corpus. I wondered if, after transposing texts to sequences of the letters F, S, B and P, I could still discern patterns specific to particular authors using standard stylometric techniques.
Christof Schöch has produced some interesting analyses of a corpus of seventeenth-century French plays, and because he has generously made his corpus available online, I decided to see what I could do with it. First I transformed the texts into sequences of letters using Queneau’s schema along with P for punctuation. Here’s what the first few lines of Molière’s Tartuffe look like:
F P S P S P F F F F B P B F B F F F B S B P S P B P S P B F F F P B B F F F F B P F F F F F B F F F B P S B P B S F B F F P B F F F B F B F P S F F B F B S S P F P B F F F F B P F F B F B S P S F B F P F B F P S B F F B B S P F P
Even though there are only four letters used in this reduction of a text, those letters can still be read. I performed the following cluster analysis of Schöch’s corpus (based on 5-grams of words, where each word is one and only one of the letters F, S, B and P):
At first glance the texts clustered somewhat according to author, but upon closer examination I noticed that the corpus clustered perfectly into groups of verse texts (marked with ‘-V-‘) and prose texts (‘-P-‘). I did not expect this. Traditional verse is determined by meter and rhyme, but Queneau’s schema reduces a text to four letters representing its parts of speech and punctuation. In order to determine what was distinguishing verse from prose, I needed to take a closer look at the matrices.
Given any text, one can produce a transition matrix that represents the frequencies of Markov chains of bigrams based on Queneau’s schema. Here is the transition matrix for Tartuffe:
S | F | B | P | |
S | 0.2158505 | 0.2651418 | 0.2738402 | 0.2451675 |
F | 0.0000000 | 0.3850806 | 0.4949597 | 0.1199597 |
B | 0.2442071 | 0.2218416 | 0.2063268 | 0.3276244 |
P | 0.3977865 | 0.3662109 | 0.2063802 | 0.0296224 |
This gives us sixteen possible bigram combinations, although in reality there are only fifteen because FS never occurs (FS = B). We can assign the frequency of each bigram as a distinct measurement of a text and then analyze all the texts in the corpus as 15-dimensional vectors.
Here is where PCA is very handy. Jonathon Shlens has written a very helpful and accessible explanation of Principle Component Analysis as a method of reducing the complexity of multi-dimensional data spaces in order to more easily visualize underlying structures. There is no way I can visualize data in fifteen dimensions, but I should be able to do it in two or three dimensions as long as I can transform the data to remove redundancies. PCA is appropriate because the data are linear (if you add up the cells in each row of a transition matrix, you always get 1).
As a novice user of the R statistics package, I found help from Emily Mankin’s tutorial, Steve Pittard’s videos and Aaron Schumacher’s explanation of 3D graphs. After running prcomp() on the entire corpus, I determined that there are not two but three significant principle components within my 15D vector space. On the one hand, this was a significant reduction that I could visualize, but on the other it required a triplot (a graph of three principle components) that would not be easy to render on a screen. It is possible, however, to project biplots of each pair of principle components from the triplot. The black dots are prose texts and the red dots are verse. The green lines represent the rotations of the 15 variables. I need at least three images of biplots to represent the all the relationships between PC1, PC2 and PC3:
The significant rotations for PC1 are SP, PF, FF, BF and FP negatively correlated with BB, SS, BS, FB, SB and PS; those for PC2 are BF, SF and FF negatively correlated with FP, SS, PB and BP; and for PC3 the significant rotations are PP, FP and SF negatively correlated with FB, FF, BB and PB. I’m still trying to sort this all out but the next image clearly shows how prose and verse texts separate in the triplot:
There is a higher tendancy among verse texts toward SS (consecutive signifiers), PS (initial signifiers after punctuation), SB and BS (signifiers and a bi-words in either order). Prose texts tend toward higher lower frequencies of SP (signifiers with no preceding formatives, followed by punctuation), FF (consecutive formatives), PB (initial bi-words after punctuation), PF (punctuation followed by formatives) and BF (bi-words followed by formatives). From these observations we could extrapolate further and say that in the syntactical structure of a text, verse tends to feature signifiers and prose tends to feature avoid formatives.
These results are of course preliminary and I need to examine the PCA analysis further, but there seems to be a definite measurable difference between verse and prose, at least in French. And what is remarkable is that this difference does not depend on specific word choice, meter or rhyme, even though those are the qualities readers appreciate in verse. I have completed a comparable analysis with the ABU corpus (over 200 works in French spanning many centuries) and the results are similar: verse and prose are measurably different according to a purely syntactical analysis, with no explicit reference to semantics, phonetics or scansion. Monsieur Jourdain would be pleased.