Last month Taper #6 was published, featuring two code poems I wrote on the theme of “a throw of the dice,” in reference to Mallarmé’s poem Un coup de dés. In a version that includes results from my experiments with anaglyphic text, I modified Des coups d’Un coup de dés so that sequences of nouns from Mallarmé’s poem appear in three dimensions (you will need special glasses for the effect).
Category Archives: French
Computation and Rhetorical Invention: Finding Things to Say With word2vec
Is it possible to make a sonnet based on a theme without having to attend simultaneously to form? Can one focus on inventio to generate a poem and leave much of the elocutio to algorithms?
Below is a web prototype for generating sonnets in French, English or Spanish using verses from the Théâtre Classique‘s collection of plays, Allison Parish’s compilation of poetry from Project Gutenberg, or the Corpus of Spanish Golden-Age Sonnets. The code, available here, makes use of natural language processing tools to enable a user to invent (in the rhetorical sense of finding things to say with language) a sonnet with an initial verse and a pair of words as the basis of an analogy using word2vec. For now the initial verse is selected with a quick word search against the corpus, and only the first 50 random verses that match the query are retrieved (if there are 50 or less verses that match the query, all of them are retrieved). The pair of words can be anything in the corpus, and while terms such as femme and homme establish a clear binary opposition with the same part of speech, any two words can be used. The pair serves as the basis of an analogy (one of Aristotle’s topoi for rhetorical invention) for systematically transforming a verse, word by word. The procedure (explained more fully in the code repo) takes a verse, modifies it by analogy, finds another verse in the corpus that most closely matches the transformed verse and adheres to a specific rhyme scheme for sonnets (abba abba ccd eed, with alternating masculine and feminine rhymes in French), and repeats the process until 14 verses are selected.
Go ahead and generate a sonnet, using the defaults if you wish, and see what happens:
You may invent imperfect sonnets with the generator. All the verses in French should be alexandrines, and those in English pentameters, but sometimes the rhymes are off, either because the same rhyming word is used repeatedly or two words that are supposed to rhyme do not. This is a bug I am working on.
You will notice that if you move the pointer over a verse, green italicized text will appear. This text is the result of transforming the verse by the analogy based on the pair of words. In this way you can begin to infer how each verse was selected from the corpus to generate the sonnet. If you supply other words as parameters, you can find verses with different analogies.
Another recently added feature is the ability to edit the last verse selected. The code attempts to verify that the edited verse complies with the defined rules for rhyme and scansion for this particular kind of sonnet, and if the edited verse does not comply, it is rejected and the original verse is restored.
The idea for selecting verses from other poems in order to assemble a new poem is not new. As early as the third or fourth century C.E. authors were recycling verses from Virgil in a form identified by Ausonius as the cento.
This is very much alpha code. It may be possible to produce an interesting sonnet, but what I find interesting in this project is the way one can model a particular approach to inventing poetry to observe how tools usually deployed for computational analysis (word embeddings, tf-idf vectors, phonetic transliteration) can contribute to creative synthesis.
From Elocutio to Inventio with Vector Space Models of Words
In his 1966 essay “Rhétorique et enseignement,” Gérard Genette observes that literary studies did not always emphasize the reading of texts. Before the end of the nineteenth century, the study of literature revolved around the art of writing. Texts were not objects to interpret but models to imitate. The study of literature emphasized elocutio, or style and the arrangement of words. With the rise of literary history, academic reading approached texts as objects to be explained. Students learned to read in order to write essays (dissertations) where they analyzed texts according to prescribed methods. This new way of studying literature stressed dispositio, or the organization of ideas.
Recent developments in information technology have challenged these paradigms for reading literature. Digital tools and resources allow for the study of large collections of texts using quantitative methods. Various computational methods of distant as well as close reading facilitate investigations into fundamental questions of the possibilities for literary creation. Technology has the potential for exploring inventio, or the finding of ideas that can be expressed through writing.
The Word Vector Text Modulator is an attempt to test if technology can foster inventio as a mode of reading. It is a Python script that makes use of vector space models of vocabularies mapped from a corpus of over 1,300 nineteenth-century documents in order to transform a text semantically according to how language was used within the corpus. An experiment such as this explores the potentiality of language as members of the Oulipo have done with techniques such as Jean Lescure’s S+7 method, Marcel Bénabou’s aphorism formulas and the ALAMO’s rimbaudelaire poems. With technology we can investigate not only how something was written and why it was written, but also what was possible to write given an historical linguistic context.
Oulipian Code
In the Atlas de littérature potentielle (1981, rev. 1988) the Oulipo mentions a number of experiments with computers as tools for exploring algorithmic constraints on writing. One example is the complete text of a computer program written by Paul Braffort that generates aphorisms (311-315). Today such programs are textbook exercises for learning computer languages, but Braffort wrote the program for a mainframe in the 1970s using the language APL (A Programming Language). Developed by Kenneth Iverson at IBM in the 1960s, APL is one of the earliest computer languages (after Fortran and Algol) designed to manipulate data as matrices. Although it is still in use by some programmers working in financial analysis, APL today is a fairly obscure language for which there are few compilers and interpreters.
In the 1981 edition of the Atlas Braffort extols the virtues of APL not only as a system of notations for formalizing literary structures but also as code that executes complex algorithms (113). Although he claims his computer program provides “a thoroughly complete analysis of the procedures used” to generate aphorisms, it needs to be executed in order to test the analysis and observe how the algorithms work. To this end I have transcribed the code published in the Atlas so that an APL interpreter can compile and execute it. The code is comprised of specific functions and pre-loaded variables. To run the code, you need to )LOAD this file into an APL interpreter such as APLX (there are other interpreters out there but APLX is the only one I have successfully installed in OSX and Ubuntu). At the prompt enter your name and the code will deliver an aphorism for each character you type (including the space between your first and last names).
If you manage to get the code to run, you may wish to understand how it works. For that I recommend APLX’s online tutorial.
Visualizing Text Spaces
For the DH2014 conference in Lausanne, Switzerland I prepared an interactive visualization of the small corpus of seventeenth-century French plays I had analyzed using Raymond Queneau’s matrix analysis of language. The visualization shows that Queneau’s matrix analysis can distinguish verse from prose fairly well by syntax alone, without any direct measurement of meter, rhyme or word choice.
There were some limitations with that initial visualization, however. First, it assumes that a text is either verse or prose when many texts are mixed. It should be possible to account for a spectrum of texts varying from exclusively prose to exclusively text, with most somewhere in-between. Second, with 72 texts it represents a relatively small sample of data. A much larger set (631 texts) is available. Third, while text type (verse or prose) is the dominant signal in the corpus, it should be possible to observe if other parameters such as author or date determine a text’s relationship to other texts in the corpus.
I have created a visualization that attempts to overcome these limitations. Using WebGL and Shiny, the visualization offers an interactive three-dimensional representation that combines three biplots for three principal components. Please see my earlier post where I explain how matrix analysis works and how I represent it with “triplots”.
Instead of using a binary color scheme to indicate prose or verse, I have calculated a percentage of verse in each text, with blue representing verse and red representing prose. Mixed texts appear as various shades of violet. As one would expect, the texts represented by violet spheres inhabit a zone between red (prose) and blue (verse) spheres. This suggests that although prose and verse texts do separate largely according to their syntax, there is a continuum from one to the other.
Although text type (prose/verse) remains the strongest signal in matrix analysis, the texts by some authors do tend to cluster in the visualization. One can search for texts by Pierre Corneille and observe a predominance of PF and SF. Texts by Molière exhibit a relative paucity of FF and BF. The five plays by Jean-Jacques Rousseau, however, are dispersed in the visualization and suggest a varied syntax in his play-writing.
Apart from the fact that most plays written before 1690 are in verse, it is difficult to see correlations between dates and text types. Further exploration may reveal correlations, however.
By experimenting with visualizing these data, I have found that three-dimensional images with meaningful chromatics allow for effective interaction with a fairly complex set of textual data. Corpora appear as spaces where distances between textual objects depend on how one defines relationships between the objects. One could imagine other bases for constructing spaces of texts, such as semantics, phonology, geography (not necessarily determined by a preexisting map), and thematics.
Matrix Analysis and Monsieur Jourdain
Lately I have taken an interest in stylometry. After attending some very interesting panels on stylometry at DH2013, I wondered if I could further develop my experiments with Raymond Queneau’s matrix analysis. I had already applied a method using Markov chains to reduce texts to a simplified representation of their syntactic structure according to the schema proposed by Queneau. This method works fairly well for authorship attribution. I have been playing with stylo and learning about cluster analysis, principle component analysis, and other statistical techniques to measure stylistic differences among texts in a corpus. I wondered if, after transposing texts to sequences of the letters F, S, B and P, I could still discern patterns specific to particular authors using standard stylometric techniques.
Christof Schöch has produced some interesting analyses of a corpus of seventeenth-century French plays, and because he has generously made his corpus available online, I decided to see what I could do with it. First I transformed the texts into sequences of letters using Queneau’s schema along with P for punctuation. Here’s what the first few lines of Molière’s Tartuffe look like:
F P S P S P F F F F B P B F B F F F B S B P S P B P S P B F F F P B B F F F F B P F F F F F B F F F B P S B P B S F B F F P B F F F B F B F P S F F B F B S S P F P B F F F F B P F F B F B S P S F B F P F B F P S B F F B B S P F P
Even though there are only four letters used in this reduction of a text, those letters can still be read. I performed the following cluster analysis of Schöch’s corpus (based on 5-grams of words, where each word is one and only one of the letters F, S, B and P):
At first glance the texts clustered somewhat according to author, but upon closer examination I noticed that the corpus clustered perfectly into groups of verse texts (marked with ‘-V-‘) and prose texts (‘-P-‘). I did not expect this. Traditional verse is determined by meter and rhyme, but Queneau’s schema reduces a text to four letters representing its parts of speech and punctuation. In order to determine what was distinguishing verse from prose, I needed to take a closer look at the matrices.
Given any text, one can produce a transition matrix that represents the frequencies of Markov chains of bigrams based on Queneau’s schema. Here is the transition matrix for Tartuffe:
S | F | B | P | |
S | 0.2158505 | 0.2651418 | 0.2738402 | 0.2451675 |
F | 0.0000000 | 0.3850806 | 0.4949597 | 0.1199597 |
B | 0.2442071 | 0.2218416 | 0.2063268 | 0.3276244 |
P | 0.3977865 | 0.3662109 | 0.2063802 | 0.0296224 |
This gives us sixteen possible bigram combinations, although in reality there are only fifteen because FS never occurs (FS = B). We can assign the frequency of each bigram as a distinct measurement of a text and then analyze all the texts in the corpus as 15-dimensional vectors.
Here is where PCA is very handy. Jonathon Shlens has written a very helpful and accessible explanation of Principle Component Analysis as a method of reducing the complexity of multi-dimensional data spaces in order to more easily visualize underlying structures. There is no way I can visualize data in fifteen dimensions, but I should be able to do it in two or three dimensions as long as I can transform the data to remove redundancies. PCA is appropriate because the data are linear (if you add up the cells in each row of a transition matrix, you always get 1).
As a novice user of the R statistics package, I found help from Emily Mankin’s tutorial, Steve Pittard’s videos and Aaron Schumacher’s explanation of 3D graphs. After running prcomp() on the entire corpus, I determined that there are not two but three significant principle components within my 15D vector space. On the one hand, this was a significant reduction that I could visualize, but on the other it required a triplot (a graph of three principle components) that would not be easy to render on a screen. It is possible, however, to project biplots of each pair of principle components from the triplot. The black dots are prose texts and the red dots are verse. The green lines represent the rotations of the 15 variables. I need at least three images of biplots to represent the all the relationships between PC1, PC2 and PC3:
The significant rotations for PC1 are SP, PF, FF, BF and FP negatively correlated with BB, SS, BS, FB, SB and PS; those for PC2 are BF, SF and FF negatively correlated with FP, SS, PB and BP; and for PC3 the significant rotations are PP, FP and SF negatively correlated with FB, FF, BB and PB. I’m still trying to sort this all out but the next image clearly shows how prose and verse texts separate in the triplot:
There is a higher tendancy among verse texts toward SS (consecutive signifiers), PS (initial signifiers after punctuation), SB and BS (signifiers and a bi-words in either order). Prose texts tend toward higher lower frequencies of SP (signifiers with no preceding formatives, followed by punctuation), FF (consecutive formatives), PB (initial bi-words after punctuation), PF (punctuation followed by formatives) and BF (bi-words followed by formatives). From these observations we could extrapolate further and say that in the syntactical structure of a text, verse tends to feature signifiers and prose tends to feature avoid formatives.
These results are of course preliminary and I need to examine the PCA analysis further, but there seems to be a definite measurable difference between verse and prose, at least in French. And what is remarkable is that this difference does not depend on specific word choice, meter or rhyme, even though those are the qualities readers appreciate in verse. I have completed a comparable analysis with the ABU corpus (over 200 works in French spanning many centuries) and the results are similar: verse and prose are measurably different according to a purely syntactical analysis, with no explicit reference to semantics, phonetics or scansion. Monsieur Jourdain would be pleased.
Immerse yourself in France
In January 2014 I will offer a language immersion program in Tours, France. Students will use the French they learn as they step outside the classroom and interact with their host families, other international students, and local merchants in the royal city of Tours. Students will also travel to Paris and be able to explore all that the City of Light has to offer. The program will fulfill the Hartwick College language requirement: no additional course is required. The program is open to all students, including those who have not studied French previously. For more information, visit the College’s website.