Visualizing textual sequence alignment

Update: I recently migrated this site to a new machine, and many of the links below are broken. I have, however, developed an experimental interface for exploring a relatively small corpus of nineteenth-century French adventure novels using the HEB (hierarchical edge bundling) visualization I describe below along with standard text retrieval and topic modeling.

In my dissertation research on Gabriel Ferry, a relatively unknown nineteenth-century author who wrote western novels in French, I included a chapter on his position in the literary field and situated his work among other authors who wrote in the same genre. I identified nine authors altogether who produced collectively over 200 novels. After reading a few of these novels, I observed that they recycled the same plot devices, descriptions and characters. Access to many of these novels was limited, and it would have required a considerable amount of time tracking them down and reading them. Instead, I analyzed their titles using publication data to determine common themes and representations. In 1998 this was, I thought, a satisfactory approach.

Since that time, Gallica has digitized most of the novels, so access is no longer a problem. There was still the problem of reading 198 novels, however, and encoding them with TEI for a search engine like Philologic would be a major undertaking. Recently I discovered Text::Pair, a Perl module that can index a corpus of texts and then compare each text to all the others to identify common strings of word sequences, what we might call recycled or even plagiarized text. With Text::Pair I am able to determine what the common text sequences are and who used them in their work. Text::Pair will work with any text file, including plain .txt ones with no encoding.

After indexing the corpus, Text::Pair identified 6,992 pairs of text sequences shared between at least two different novels. Many of these sequences were shared by more than two novels. In order to get an overview of all the pairings between texts, I wanted to find a way to visualize them. Using the d3.js library, I have produced some visualizations of the Text::Pair results:

  • The first, based on this implementation of Holten’s algorithm for hierarchical edge bundling, shows pairings as dependencies between texts. This visualization is based on one for showing dependencies between classes in a software package. Each line in the image linking one text to another in the corpus represents a textual pairing. The visualization does not indicate which document is the source and which is the target for the textual import, but it does show which texts are related by pairings. Clearly, Aimard’s La Grande Flibuste of 1860 is linked to many documents in the corpus. I will need to read this one more closely because it contains text sequences shared by many other documents.

    Update: you can now use the pointer to highlight paths between documents. A red line shows that the selected document imports from another; a green line shows that the selected document is a source document.

  • The second, based on two chord diagrams (here and here), gives a better idea of the imports between texts by the same author and by different authors. The colors of the ribbons correspond to the authors of the source text (move the pointer in your browser to the outer rings to see the ribbons better for each author). The visualization indicates that Aimard mostly recycled his own text, whereas most of the text pairings in Ferry’s work linked it to other authors.

    Update: the d3.js chord diagrams are based on the design of Circos. Here is an image generated from my data using the Circos online table viewer.

  • New (2012 March 04) I have hacked an adaptation of Bostock’s treemap to display the number of texts in the corpus by size (i.e. how often they are recycled by later texts) and by count. Click the buttons to move from one visualization to the other. The author and title of each work appears when the pointer hovers over a rectangle in the treemap.
  • New (2012 March 04) I have hacked two sunburst visualizations, one with author and title labels when the pointer hovers over a segment in the outer ring, the other with no labels. The first replicates the treemap “Count” function, but if you click on the “Size” button nothing happens. The second sunburst will redraw itself when you click on the size button. Bostock’s sunburst displays parent/child relationships nicely (in this case, author/title) but labels would be very helpful to see these relationships.

I need to look at these visualizations more closely before I reach any definitive conclusions. They are pretty cool, though, and suggest that there are ways to look at relationships between texts other than with tables.

Leave a Reply

Your email address will not be published. Required fields are marked *