Visualizing textual sequence alignment

In my dissertation research on Gabriel Ferry, a relatively unknown nineteenth-century author who wrote western novels in French, I included a chapter on his position in the literary field and situated his work among other authors who wrote in the same genre. I identified nine authors altogether who produced collectively over 200 novels. After reading a few of these novels, I observed that they recycled the same plot devices, descriptions and characters. Access to many of these novels was limited, and it would have required a considerable amount of time tracking them down and reading them. Instead, I analyzed their titles using publication data to determine common themes and representations. In 1998 this was, I thought, a satisfactory approach.

Since that time, Gallica has digitized most of the novels, so access is no longer a problem. There was still the problem of reading 198 novels, however, and encoding them with TEI for a search engine like Philologic would be a major undertaking. Recently I discovered Text::Pair, a Perl module that can index a corpus of texts and then compare each text to all the others to identify common strings of word sequences, what we might call recycled or even plagiarized text. With Text::Pair I am able to determine what the common text sequences are and who used them in their work. Text::Pair will work with any text file, including plain .txt ones with no encoding.

After indexing the corpus, Text::Pair identified 6,992 pairs of text sequences shared between at least two different novels. Many of these sequences were shared by more than two novels. In order to get an overview of all the pairings between texts, I wanted to find a way to visualize them. Using the d3.js library, I have produced some visualizations of the Text::Pair results:

  • The first, based on this implementation of Holten’s algorithm for hierarchical edge bundling, shows pairings as dependencies between texts. This visualization is based on one for showing dependencies between classes in a software package. Each line in the image linking one text to another in the corpus represents a textual pairing. The visualization does not indicate which document is the source and which is the target for the textual import, but it does show which texts are related by pairings. Clearly, Aimard’s La Grande Flibuste of 1860 is linked to many documents in the corpus. I will need to read this one more closely because it contains text sequences shared by many other documents.
  • The second, based on two chord diagrams (here and here), gives a better idea of the imports between texts by the same author and by different authors. The colors of the ribbons correspond to the authors of the source text (move the pointer in your browser to the outer rings to see the ribbons for source texts better for each author). The visualization indicates that much of Aimard’s work appeared first in other authors’ works. Paul Duplessis appears to be the most original of all these authors: he is the source of sequences but not the target of any (with the exception of Cooper, for whom the sample is very small).

Leave a Reply

Your email address will not be published. Required fields are marked *