Monthly Archives: August 2014

Visualizing Text Spaces

3D_matrix_analysis_Image

For the DH2014 conference in Lausanne, Switzerland I prepared an interactive visualization of the small corpus of seventeenth-century French plays I had analyzed using Raymond Queneau’s matrix analysis of language. The visualization shows that Queneau’s matrix analysis can distinguish verse from prose fairly well by syntax alone, without any direct measurement of meter, rhyme or word choice.

There were some limitations with that initial visualization, however. First, it assumes that a text is either verse or prose when many texts are mixed. It should be possible to account for a spectrum of texts varying from exclusively prose to exclusively text, with most somewhere in-between. Second, with 72 texts it represents a relatively small sample of data. A much larger set (631 texts) is available. Third, while text type (verse or prose) is the dominant signal in the corpus, it should be possible to observe if other parameters such as author or date determine a text’s relationship to other texts in the corpus.

I have created a visualization that attempts to overcome these limitations. Using WebGL and Shiny, the visualization offers an interactive three-dimensional representation that combines three biplots for three principal components. Please see my earlier post where I explain how matrix analysis works and how I represent it with “triplots”.

Instead of using a binary color scheme to indicate prose or verse, I have calculated a percentage of verse in each text, with blue representing verse and red representing prose. Mixed texts appear as various shades of violet. As one would expect, the texts represented by violet spheres inhabit a zone between red (prose) and blue (verse) spheres. This suggests that although prose and verse texts do separate largely according to their syntax, there is a continuum from one to the other.

Although text type (prose/verse) remains the strongest signal in matrix analysis, the texts by some authors do tend to cluster in the visualization. One can search for texts by Pierre Corneille and observe a predominance of PF and SF. Texts by Molière exhibit a relative paucity of FF and BF. The five plays by Jean-Jacques Rousseau, however, are dispersed in the visualization and suggest a varied syntax in his play-writing.

Apart from the fact that most plays written before 1690 are in verse, it is difficult to see correlations between dates and text types. Further exploration may reveal correlations, however.

By experimenting with visualizing these data, I have found that three-dimensional images with meaningful chromatics allow for effective interaction with a fairly complex set of textual data. Corpora appear as spaces where distances between textual objects depend on how one defines relationships between the objects. One could imagine other bases for constructing spaces of texts, such as semantics, phonology, geography (not necessarily determined by a preexisting map), and thematics.