Lately I have taken an interest in stylometry. After attending some very interesting panels on stylometry at DH2013, I wondered if I could further develop my experiments with Raymond Queneau’s matrix analysis. I had already applied a method using Markov chains to reduce texts to a simplified representation of their syntactic structure according to the schema proposed by Queneau. This method works fairly well for authorship attribution. I have been playing with stylo and learning about cluster analysis, principle component analysis, and other statistical techniques to measure stylistic differences among texts in a corpus. I wondered if, after transposing texts to sequences of the letters F, S, B and P, I could still discern patterns specific to particular authors using standard stylometric techniques.
Christof Schöch has produced some interesting analyses of a corpus of seventeenth-century French plays, and because he has generously made his corpus available online, I decided to see what I could do with it. First I transformed the texts into sequences of letters using Queneau’s schema along with P for punctuation. Here’s what the first few lines of Molière’s Tartuffe look like:
F P S P S P F F F F B P B F B F F F B S B P S P B P S P B F F F P B B F F F F B P F F F F F B F F F B P S B P B S F B F F P B F F F B F B F P S F F B F B S S P F P B F F F F B P F F B F B S P S F B F P F B F P S B F F B B S P F P
Even though there are only four letters used in this reduction of a text, those letters can still be read. I performed the following cluster analysis of Schöch’s corpus (based on 5-grams of words, where each word is one and only one of the letters F, S, B and P):
At first glance the texts clustered somewhat according to author, but upon closer examination I noticed that the corpus clustered perfectly into groups of verse texts (marked with ‘-V-‘) and prose texts (‘-P-‘). I did not expect this. Traditional verse is determined by meter and rhyme, but Queneau’s schema reduces a text to four letters representing its parts of speech and punctuation. In order to determine what was distinguishing verse from prose, I needed to take a closer look at the matrices.
Given any text, one can produce a transition matrix that represents the frequencies of Markov chains of bigrams based on Queneau’s schema. Here is the transition matrix for Tartuffe:
S | F | B | P | |
S | 0.2158505 | 0.2651418 | 0.2738402 | 0.2451675 |
F | 0.0000000 | 0.3850806 | 0.4949597 | 0.1199597 |
B | 0.2442071 | 0.2218416 | 0.2063268 | 0.3276244 |
P | 0.3977865 | 0.3662109 | 0.2063802 | 0.0296224 |
This gives us sixteen possible bigram combinations, although in reality there are only fifteen because FS never occurs (FS = B). We can assign the frequency of each bigram as a distinct measurement of a text and then analyze all the texts in the corpus as 15-dimensional vectors.
Here is where PCA is very handy. Jonathon Shlens has written a very helpful and accessible explanation of Principle Component Analysis as a method of reducing the complexity of multi-dimensional data spaces in order to more easily visualize underlying structures. There is no way I can visualize data in fifteen dimensions, but I should be able to do it in two or three dimensions as long as I can transform the data to remove redundancies. PCA is appropriate because the data are linear (if you add up the cells in each row of a transition matrix, you always get 1).
As a novice user of the R statistics package, I found help from Emily Mankin’s tutorial, Steve Pittard’s videos and Aaron Schumacher’s explanation of 3D graphs. After running prcomp() on the entire corpus, I determined that there are not two but three significant principle components within my 15D vector space. On the one hand, this was a significant reduction that I could visualize, but on the other it required a triplot (a graph of three principle components) that would not be easy to render on a screen. It is possible, however, to project biplots of each pair of principle components from the triplot. The black dots are prose texts and the red dots are verse. The green lines represent the rotations of the 15 variables. I need at least three images of biplots to represent the all the relationships between PC1, PC2 and PC3:
The significant rotations for PC1 are SP, PF, FF, BF and FP negatively correlated with BB, SS, BS, FB, SB and PS; those for PC2 are BF, SF and FF negatively correlated with FP, SS, PB and BP; and for PC3 the significant rotations are PP, FP and SF negatively correlated with FB, FF, BB and PB. I’m still trying to sort this all out but the next image clearly shows how prose and verse texts separate in the triplot:
There is a higher tendancy among verse texts toward SS (consecutive signifiers), PS (initial signifiers after punctuation), SB and BS (signifiers and a bi-words in either order). Prose texts tend toward higher lower frequencies of SP (signifiers with no preceding formatives, followed by punctuation), FF (consecutive formatives), PB (initial bi-words after punctuation), PF (punctuation followed by formatives) and BF (bi-words followed by formatives). From these observations we could extrapolate further and say that in the syntactical structure of a text, verse tends to feature signifiers and prose tends to feature avoid formatives.
These results are of course preliminary and I need to examine the PCA analysis further, but there seems to be a definite measurable difference between verse and prose, at least in French. And what is remarkable is that this difference does not depend on specific word choice, meter or rhyme, even though those are the qualities readers appreciate in verse. I have completed a comparable analysis with the ABU corpus (over 200 works in French spanning many centuries) and the results are similar: verse and prose are measurably different according to a purely syntactical analysis, with no explicit reference to semantics, phonetics or scansion. Monsieur Jourdain would be pleased.
After thinking about this some more I made some quick edits. There is a negative correlation for the vectors pointing in the direction of the prose data points, which would mean that prose tends to show a diminished presence of SP, SF, FF, BF, FP and PP compared to what one would find in verse.
This is so cool! Thanks for linking back – really neat to see this application!
Dear Mark, thanks so much for posting this, very intriguing! Just some thoughts:
First, if it wasn’t for Queneau’s authority, I would ask why you would want to reduce the information in you data with such drastic measures in the first place. Just reducing a text to it’s parts-of-speech is already quite drastic. More to the point, your previous post mentions F, S and B as possible results of the transformation, but here you seem to have F, S, B and P. What’s the P?
Second, you show that contrary to what Queneau suspected, the author signal does suffer considerably from the information reduction, while the form signal (verse vs. prose) does not. Of course, Queneau’s reasoning was still sound in theory, saying that those aspects of style not consciously controlled may be the best indicators of authorship. But the Ps and Bs here are in fact much richer than function words, it seems.
My take on the verse / prose issue would be to say that this is such a deep distinction that it has consequences on many or all levels of language, not just rhyme and meter, but also word choice and syntactical choices; indeed, these are often related to meter. I’m thinking of the fact that in French plays, you will find quite a few inversions of “normal” sequence of words for the sake of meter. For instance, putting the reflexive pronoun further to the front instead of right in front of the verb: “Et retire son bras pour me mieux accabler” where “me mieux accabler” is syntactically correct at the time but already unusual in prose, but has the advantage of avoiding the elision of “me” in front of “accabler” which would make the verse one syllable short. (To be honest, checking this in Racine’s “La Thébaide” gave me less examples of this than I expected after reading it, but the principle probably holds; this would need to be checked more closely.)
Anyway, great piece and I’m definitely hoping to see more such pieces.
Thanks Christof for your thoughts on this analysis. Queneau introduced the idea of matrix analysis at the end of his book Bâtons, chiffres et lettres and he anticipated the potential of computation for literary analysis. Like others, I consider Queneau in particular and the Oulipo in general as precursors of digital humanists and I think we need to take a closer look at how the Oulipo imagined the potentiality of computation.
To answer your question, I have added P to Queneau’s schema in order to account for punctuation. It would be otherwise impossible to represent text where punctuation separates a formative and signifier without violating the rule FS = B. For example, « Alors, venez me voir » is rendered as FPSB.
In preparation for my poster at DH2014, I have created a ShinyApp for the 3D triplot. You can manipulate the image to see different angles. My data pertain to a corpus of nineteenth-century texts and I have munged the data better using a more recent version of TreeTagger.