Oulipian Code

Aphorismes de Mark Wolff

These aphorisms were generated with code developed by the Oulipo.

In the Atlas de littérature potentielle (1981, rev. 1988) the Oulipo mentions a number of experiments with computers as tools for exploring algorithmic constraints on writing. One example is the complete text of a computer program written by Paul Braffort that generates aphorisms (311-315). Today such programs are textbook exercises for learning computer languages, but Braffort wrote the program for a mainframe in the 1970s using the language APL (A Programming Language). Developed by Kenneth Iverson at IBM in the 1960s, APL is one of the earliest computer languages (after Fortran and Algol) designed to manipulate data as matrices. Although it is still in use by some programmers working in financial analysis, APL today is a fairly obscure language for which there are few compilers and interpreters.

In the 1981 edition of the Atlas Braffort extols the virtues of APL not only as a system of notations for formalizing literary structures but also as code that executes complex algorithms (113). Although he claims his computer program provides “a thoroughly complete analysis of the procedures used” to generate aphorisms, it needs to be executed in order to test the analysis and observe how the algorithms work. To this end I have transcribed the code published in the Atlas so that an APL interpreter can compile and execute it. The code is comprised of specific functions and pre-loaded variables. To run the code, you need to )LOAD this file into an APL interpreter such as APLX (there are other interpreters out there but APLX is the only one I have successfully installed in OSX and Ubuntu). At the prompt enter your name and the code will deliver an aphorism for each character you type (including the space between your first and last names).

If you manage to get the code to run, you may wish to understand how it works. For that I recommend APLX’s online tutorial.

Visualizing Text Spaces


For the DH2014 conference in Lausanne, Switzerland I prepared an interactive visualization of the small corpus of seventeenth-century French plays I had analyzed using Raymond Queneau’s matrix analysis of language. The visualization shows that Queneau’s matrix analysis can distinguish verse from prose fairly well by syntax alone, without any direct measurement of meter, rhyme or word choice.

There were some limitations with that initial visualization, however. First, it assumes that a text is either verse or prose when many texts are mixed. It should be possible to account for a spectrum of texts varying from exclusively prose to exclusively text, with most somewhere in-between. Second, with 72 texts it represents a relatively small sample of data. A much larger set (631 texts) is available. Third, while text type (verse or prose) is the dominant signal in the corpus, it should be possible to observe if other parameters such as author or date determine a text’s relationship to other texts in the corpus.

I have created a visualization that attempts to overcome these limitations. Using WebGL and Shiny, the visualization offers an interactive three-dimensional representation that combines three biplots for three principal components. Please see my earlier post where I explain how matrix analysis works and how I represent it with “triplots”.

Instead of using a binary color scheme to indicate prose or verse, I have calculated a percentage of verse in each text, with blue representing verse and red representing prose. Mixed texts appear as various shades of violet. As one would expect, the texts represented by violet spheres inhabit a zone between red (prose) and blue (verse) spheres. This suggests that although prose and verse texts do separate largely according to their syntax, there is a continuum from one to the other.

Although text type (prose/verse) remains the strongest signal in matrix analysis, the texts by some authors do tend to cluster in the visualization. One can search for texts by Pierre Corneille and observe a predominance of PF and SF. Texts by Molière exhibit a relative paucity of FF and BF. The five plays by Jean-Jacques Rousseau, however, are dispersed in the visualization and suggest a varied syntax in his play-writing.

Apart from the fact that most plays written before 1690 are in verse, it is difficult to see correlations between dates and text types. Further exploration may reveal correlations, however.

By experimenting with visualizing these data, I have found that three-dimensional images with meaningful chromatics allow for effective interaction with a fairly complex set of textual data. Corpora appear as spaces where distances between textual objects depend on how one defines relationships between the objects. One could imagine other bases for constructing spaces of texts, such as semantics, phonology, geography (not necessarily determined by a preexisting map), and thematics.

Matrix Analysis and Monsieur Jourdain

« Par ma foi ! il y a plus de quarante ans que je dis de la prose sans que j'en susse rien, et je vous suis le plus obligé du monde de m'avoir appris cela. »

« Par ma foi ! il y a plus de quarante ans que je dis de la prose sans que j’en susse rien, et je vous suis le plus obligé du monde de m’avoir appris cela. »

Lately I have taken an interest in stylometry.  After attending some very interesting panels on stylometry at DH2013, I wondered if I could further develop my experiments with Raymond Queneau’s matrix analysis.  I had already applied a method using Markov chains to reduce texts to a simplified representation of their syntactic structure according to the schema proposed by Queneau.  This method works fairly well for authorship attribution.  I have been playing with stylo and learning about cluster analysis, principle component analysis, and other statistical techniques to measure stylistic differences among texts in a corpus. I wondered if, after transposing texts to sequences of the letters F, S, B and P, I could still discern patterns specific to particular authors using standard stylometric techniques.

Christof Schöch has produced some interesting analyses of a corpus of seventeenth-century French plays, and because he has generously made his corpus available online, I decided to see what I could do with it.  First I transformed the texts into sequences of letters using Queneau’s schema along with P for punctuation. Here’s what the first few lines of Molière’s Tartuffe look like:


Even though there are only four letters used in this reduction of a text, those letters can still be read.  I performed the following cluster analysis of Schöch’s corpus (based on 5-grams of words, where each word is one and only one of the letters F, S, B and P):

Cluster Analysis of 17C Theatre Corpus

Cluster Analysis of 17C Theatre Corpus

At first glance the texts clustered somewhat according to author, but upon closer examination I noticed that the corpus clustered perfectly into groups of verse texts (marked with ‘-V-‘) and prose texts (‘-P-‘).  I did not expect this.  Traditional verse is determined by meter and rhyme, but Queneau’s schema reduces a text to four letters representing its parts of speech and punctuation.  In order to determine what was distinguishing verse from prose, I needed to take a closer look at the matrices.

Given any text, one can produce a transition matrix that represents the frequencies of Markov chains of bigrams based on Queneau’s schema.  Here is the transition matrix for Tartuffe:

S 0.2158505 0.2651418 0.2738402 0.2451675
F 0.0000000 0.3850806 0.4949597 0.1199597
B 0.2442071 0.2218416 0.2063268 0.3276244
P 0.3977865 0.3662109 0.2063802 0.0296224

This gives us sixteen possible bigram combinations, although in reality there are only fifteen because FS never occurs (FS = B).  We can assign the frequency of each bigram as a distinct measurement of a text and then analyze all the texts in the corpus as 15-dimensional vectors.

Here is where PCA is very handy.  Jonathon Shlens has written a very helpful and accessible explanation of Principle Component Analysis as a method of reducing the complexity of multi-dimensional data spaces in order to more easily visualize underlying structures.  There is no way I can visualize data in fifteen dimensions, but I should be able to do it in two or three dimensions as long as I can transform the data to remove redundancies.  PCA is appropriate because the data are linear (if you add up the cells in each row of a transition matrix, you always get 1).

As a novice user of the R statistics package, I found help from Emily Mankin’s tutorialSteve Pittard’s videos and Aaron Schumacher’s explanation of 3D graphs.  After running prcomp() on the entire corpus, I determined that there are not two but three significant principle components within my 15D vector space.  On the one hand, this was a significant reduction that I could visualize, but on the other it required a triplot (a graph of three principle components) that would not be easy to render on a screen.  It is possible, however, to project biplots of each pair of principle components from the triplot.  The black dots are prose texts and the red dots are verse.  The green lines represent the rotations of the 15 variables.  I need at least three images of biplots to represent the all the relationships between PC1, PC2 and PC3:

Projection of PC1 and PC2 from a PCA triplot

Projection of PC1 and PC3 from a 3D PCA graph

Projection of PC1 and PC3 from a PCA triplot

Projection of PC2 and PC3 from a 3D PCA graph

Projection of PC2 and PC3 from a PCA triplot

The significant rotations for PC1 are SP, PF, FF, BF and FP negatively correlated with BB, SS, BS, FB, SB and PS;  those for PC2 are BF, SF and FF negatively correlated with FP, SS, PB and BP; and for PC3 the significant rotations are PP, FP and SF negatively correlated with FB, FF, BB and PB.  I’m still trying to sort this all out but the next image clearly shows how prose and verse texts separate in the triplot:


Angled projection of PCA triplot

There is a higher tendancy among verse texts toward SS (consecutive signifiers), PS (initial signifiers after punctuation), SB and BS (signifiers and a bi-words in either order).  Prose texts tend toward higher lower frequencies of SP (signifiers with no preceding formatives, followed by punctuation), FF (consecutive formatives), PB (initial bi-words after punctuation), PF (punctuation followed by formatives) and BF (bi-words followed by formatives). From these observations we could extrapolate further and say that in the syntactical structure of a text, verse tends to feature signifiers and prose tends to feature avoid formatives.

These results are of course preliminary and I need to examine the PCA analysis further, but there seems to be a definite measurable difference between verse and prose, at least in French.  And what is remarkable is that this difference does not depend on specific word choice, meter or rhyme, even though those are the qualities readers appreciate in verse.  I have completed a comparable analysis with the ABU corpus (over 200 works in French spanning many centuries) and the results are similar:  verse and prose are measurably different according to a purely syntactical analysis, with no explicit reference to semantics, phonetics or scansion.  Monsieur Jourdain would be pleased.

Reading with machines

This fall I will teach a First-Year Seminar on computer-assisted methods of text analysis.  Students will experiment with various digital tools to discover patterns in texts and use the results to inform their interpretations.

Students will first read the novel Candide by Voltaire in print or in eBook format.  They will then write and use computer programs  to perform various analyses (word frequencies, distributions, co-occurrences, etc.) to determine if and how computers can give them additional insights for understanding the novel.  They will finally build collections of documents to see how computers can help them discover patterns on a larger scale.

Once students become familiar with various computational techniques, they will apply them to a digital archive of Hartwick student newspapers.  They will build a website allowing users to browse and search the newspapers, and they will run computational analyses to determine recurring topics and trends among Hartwick students over many decades. The results of this research will be of interest to other students, faculty, staff, and alumni.

By experimenting with computers to read texts, students will learn the challenges and opportunities of project-oriented research in the humanities.  Much of the work in the Digital Humanities involves effective collaboration of people using machines.  Students will develop skills in working as part of team as well as applying new technologies to humanities research.

No prior experience with programming is required.  Students should have a Math Placement Test score of L2 or higher, and they should feel comfortable writing simple computer programs by following examples.

More information about the course is available here.

Immerse yourself in France

In January 2014 I will offer a language immersion program in Tours, France.  Students will use the French they learn as they step outside the classroom and interact with their host families, other international students, and local merchants in the royal city of Tours.  Students will also travel to Paris and be able to explore all that the City of Light has to offer.  The program will fulfill the Hartwick College language requirement:  no additional course is required.  The program is open to all students, including those who have not studied French previously.  For more information, visit the College’s website.