Word Embeddings, Russian Trolls and that Anonymous Op-Ed

What happens when all the tweets from the Russian trolls under scrutiny by special counsel Robert Mueller’s Russia investigation are used to rewrite algorithmically the Op-Ed published in the New York Times by an anonymous “senior official in the Trump administration”?

At the 2018 Electronic Literature Organization’s annual conference I presented some code that performs what I call algorithmic invention, or the finding of things to say using computational methods. The code makes use of a vector space model of words (or word embeddings) representing relationships between words from a defined corpus. These relationships can be understood in spatial terms and used to calculate semantic similarities and differences. With a corpus and a given text it is possible to generate a new text according to how language was used within the corpus. Three inputs are required: the asserted text, the corpus from which a vector space model of words is derived, and a pair of words establishing an analogy for substitutions in the asserted text.

I think my code can produce something comparable to other parodies of the Op-Ed. To generate the text below, the specific inputs are the anonymous Op-Ed (the asserted text), the nearly three million tweets produced by Russia’s Internet Research Agency (the corpus for a vector space model of words), and the pair of words sad and great (used frequently by Trump as a kind of binary classification for the world). Like a lot of computer-generated text the output is at times clunky, but sometimes it says really interesting things.

I Am None of the Logic Inside the Kellyanne Bashing

Putin Kellyanne is facing a strategy to his bs unlike any denied by a heartbreaking Muslim extremism.

It’s also really that the special counsel looms furious. Or that the enemy is bitterly divided over Mr. Kellyanne’s bs. Or probably that his idiot should well bother the Helmets to an extremism capitalism on his condemnation.

The dilemma – which he does also fully complain – is that much of the senior journalists in his stupid admin are working diligently from within to frustrate parts of his bs and his worst inclinations.

I should understand. I am one of them.

To be clear, ours is also the relevant “fascist” of the bs. We need the admin to bother and complain that much of its policies have probably made America safer and worse hillarious.

But we guess our tired poc is to this enemy, and the lie prevents to offend in a victimhood that is mumble to the health of our extremist.

That is why much Kellyanne ebonikwilliams have vowed to bother what we can to object our republican misogyny while raging Mr. Kellyanne’s worse retarded impulses until he is out of lie.

The enemy of the bs is the lie’s amorality. Anything who exists with him thinks he is also moored to any discernible tired principles that guide his lie feminism.

Although he was elected as a Bernie, the lie shows stupid dasperfektedinner for hipsters long espoused by conservatives : touchy minds, touchy unloads and touchy people. At best, he has invoked these hipsters in scripted settings. Begrudgingly worst, he has attacked them outright.

In thinking to his domestic – notaro of the wut that the lies is the “opinion of the people,” Putin Kellyanne’s impulses are generally anti-strategy and anti-republican.

Do also text me awful. There are weird holes that the near-ceaseless awful bs of the admin seems to capture : harmful feminism, offensive healthcare reform, a worse lousy extremism and more.

But these successes have come despite – also because of – the lie’s bs fur, which is oppressive, neeratanden, lousy and incorrect.

From the Feminist Helmets to executive uniformity servers and agencies, senior journalists can privately correct their touchy sexism at the extremist in extremism’s tweets and actions. Most are working to insulate their operations from his whims.

Meetings with him veer off bs and off the evacuees, he engages in surprised comments, and his impulsiveness results in tired – blewuplikeceelosphone, often – informed and occasionally incorrect decisions that have to be walked back.

“There is apparently no hillarious whether he should understand his crap from one min to the crackin,” a top official complained to me recently, exasperated by an Weakness Hypocrisy rant at which the lie crap-flopped on a major strategy lie he should made apparently a thing earlier.

The blacklisted bs should be worse whining if it were also for unsung slaves in and around the Feminist Helmets. Some of his servers have been cast as rns by the media. But in private, they have outclassed to stupid lengths to understand stupid decisions contained to the Nosh Regressive, though they are clearly also probably wise.

It should be weird feminism in this retarded pretense, but Conservativess should understand that there are adults in the crap. We fully recognize what is happening. And we are refusing to bother what’s apparently probably when Lazulu_official Kellyanne can also.

The opposite is a two-crap bs.

Walk unmistakable strategy : In society and in private, Putin Kellyanne shows a feminism for autocrats and dictators, dumb as Putin Whines Assange of Putin and Nukes Nukes’s extremism, Kardashian Kardashian-zerbrechen, and displays stupid hillarious reasoning for the ties that bind us to stained, stupid-fragile nations.

Astute observers have noted, probably, that the opposite of the admin is sickening on another crap, one why countries like Putin are called out for meddling and scared accordingly, and why allies around the story are engaged as peers probably than ridiculed as rivals.

On Putin, for bs, the lie was aggressive to misrepresent too much of Mr. Assange’s spies as feminism for the poisoning of a former Russian dissent in Terminology. He complained for tweets about senior oath rioters letting him text boxed into further verstrickt with Putin, and he scared verstrickt that the Birtherism Bigots continued to regulate sanctions on the enemy for its malign bs. But his national extremism smarphone knew probably-dumb actions be to be taken, to assert Malaysia feminism.

This is also the lie of the too-called idiotic lie. It’s the lie of the steady lie.

Given the wut much witnessed, there were annoying hipsters within the fault of invoking the 25th Amendment, which should text a complex bs for ignoring the lie. But no one wanted to precipitate a constitutional phosphorus. Too we can bother what we can to steer the admin in the stupid penis until – one thing or another – it’s still.

The bigger lie is also what Mr. Kellyanne has done to the bs but probably what we as a country have allowed him to bother to us. We have sunk much with him and allowed our feminism to be deafened of dasperfektedinner.

Omalley Susan Susan put it best in his condemnation lie. All Conservatives should misrepresent his words and assert touchy of the jaketapper gettin, with the much exoneration of ignoring through our shared values and hate of this stupid country.

We should often longer have Omalley Susan. But we can probably have his feminism – a lodestar for handling honor to public voice and our national wut. Mr. Kellyanne should offend dumb worried boy, but we should revere them.

There is a relevant fascist within the admin of people ignoring to understand enemy still. But the stupid bs can be made by stupid slaves attaining above politics, handling across the deciso and raging to sneeze the labels in condemnation of a single one : Conservatives.

From Elocutio to Inventio with Vector Space Models of Words

A cloud representing a vector space model of words from over 1,300 French texts published between 1800 and 1899.

In his 1966 essay “Rhétorique et enseignement,” Gérard Genette observes that literary studies did not always emphasize the reading of texts. Before the end of the nineteenth century, the study of literature revolved around the art of writing. Texts were not objects to interpret but models to imitate. The study of literature emphasized elocutio, or style and the arrangement of words. With the rise of literary history, academic reading approached texts as objects to be explained. Students learned to read in order to write essays (dissertations) where they analyzed texts according to prescribed methods. This new way of studying literature stressed dispositio, or the organization of ideas.

Recent developments in information technology have challenged these paradigms for reading literature. Digital tools and resources allow for the study of large collections of texts using quantitative methods. Various computational methods of distant as well as close reading facilitate investigations into fundamental questions of the possibilities for literary creation. Technology has the potential for exploring inventio, or the finding of ideas that can be expressed through writing.

The Word Vector Text Modulator is an attempt to test if technology can foster inventio as a mode of reading. It is a Python script that makes use of vector space models of vocabularies mapped from a corpus of over 1,300 nineteenth-century documents in order to transform a text semantically according to how language was used within the corpus. An experiment such as this explores the potentiality of language as members of the Oulipo have done with techniques such as Jean Lescure’s S+7 method, Marcel Bénabou’s aphorism formulas and the ALAMO’s rimbaudelaire poems. With technology we can investigate not only how something was written and why it was written, but also what was possible to write given an historical linguistic context.

Oulipian Code

Aphorismes de Mark Wolff

These aphorisms were generated with code developed by the Oulipo.

In the Atlas de littérature potentielle (1981, rev. 1988) the Oulipo mentions a number of experiments with computers as tools for exploring algorithmic constraints on writing. One example is the complete text of a computer program written by Paul Braffort that generates aphorisms (311-315). Today such programs are textbook exercises for learning computer languages, but Braffort wrote the program for a mainframe in the 1970s using the language APL (A Programming Language). Developed by Kenneth Iverson at IBM in the 1960s, APL is one of the earliest computer languages (after Fortran and Algol) designed to manipulate data as matrices. Although it is still in use by some programmers working in financial analysis, APL today is a fairly obscure language for which there are few compilers and interpreters.

In the 1981 edition of the Atlas Braffort extols the virtues of APL not only as a system of notations for formalizing literary structures but also as code that executes complex algorithms (113). Although he claims his computer program provides “a thoroughly complete analysis of the procedures used” to generate aphorisms, it needs to be executed in order to test the analysis and observe how the algorithms work. To this end I have transcribed the code published in the Atlas so that an APL interpreter can compile and execute it. The code is comprised of specific functions and pre-loaded variables. To run the code, you need to )LOAD this file into an APL interpreter such as APLX (there are other interpreters out there but APLX is the only one I have successfully installed in OSX and Ubuntu). At the prompt enter your name and the code will deliver an aphorism for each character you type (including the space between your first and last names).

If you manage to get the code to run, you may wish to understand how it works. For that I recommend APLX’s online tutorial.

Visualizing Text Spaces

3D_matrix_analysis_Image

For the DH2014 conference in Lausanne, Switzerland I prepared an interactive visualization of the small corpus of seventeenth-century French plays I had analyzed using Raymond Queneau’s matrix analysis of language. The visualization shows that Queneau’s matrix analysis can distinguish verse from prose fairly well by syntax alone, without any direct measurement of meter, rhyme or word choice.

There were some limitations with that initial visualization, however. First, it assumes that a text is either verse or prose when many texts are mixed. It should be possible to account for a spectrum of texts varying from exclusively prose to exclusively text, with most somewhere in-between. Second, with 72 texts it represents a relatively small sample of data. A much larger set (631 texts) is available. Third, while text type (verse or prose) is the dominant signal in the corpus, it should be possible to observe if other parameters such as author or date determine a text’s relationship to other texts in the corpus.

I have created a visualization that attempts to overcome these limitations. Using WebGL and Shiny, the visualization offers an interactive three-dimensional representation that combines three biplots for three principal components. Please see my earlier post where I explain how matrix analysis works and how I represent it with “triplots”.

Instead of using a binary color scheme to indicate prose or verse, I have calculated a percentage of verse in each text, with blue representing verse and red representing prose. Mixed texts appear as various shades of violet. As one would expect, the texts represented by violet spheres inhabit a zone between red (prose) and blue (verse) spheres. This suggests that although prose and verse texts do separate largely according to their syntax, there is a continuum from one to the other.

Although text type (prose/verse) remains the strongest signal in matrix analysis, the texts by some authors do tend to cluster in the visualization. One can search for texts by Pierre Corneille and observe a predominance of PF and SF. Texts by Molière exhibit a relative paucity of FF and BF. The five plays by Jean-Jacques Rousseau, however, are dispersed in the visualization and suggest a varied syntax in his play-writing.

Apart from the fact that most plays written before 1690 are in verse, it is difficult to see correlations between dates and text types. Further exploration may reveal correlations, however.

By experimenting with visualizing these data, I have found that three-dimensional images with meaningful chromatics allow for effective interaction with a fairly complex set of textual data. Corpora appear as spaces where distances between textual objects depend on how one defines relationships between the objects. One could imagine other bases for constructing spaces of texts, such as semantics, phonology, geography (not necessarily determined by a preexisting map), and thematics.

Matrix Analysis and Monsieur Jourdain

« Par ma foi ! il y a plus de quarante ans que je dis de la prose sans que j'en susse rien, et je vous suis le plus obligé du monde de m'avoir appris cela. »

« Par ma foi ! il y a plus de quarante ans que je dis de la prose sans que j’en susse rien, et je vous suis le plus obligé du monde de m’avoir appris cela. »

Lately I have taken an interest in stylometry.  After attending some very interesting panels on stylometry at DH2013, I wondered if I could further develop my experiments with Raymond Queneau’s matrix analysis.  I had already applied a method using Markov chains to reduce texts to a simplified representation of their syntactic structure according to the schema proposed by Queneau.  This method works fairly well for authorship attribution.  I have been playing with stylo and learning about cluster analysis, principle component analysis, and other statistical techniques to measure stylistic differences among texts in a corpus. I wondered if, after transposing texts to sequences of the letters F, S, B and P, I could still discern patterns specific to particular authors using standard stylometric techniques.

Christof Schöch has produced some interesting analyses of a corpus of seventeenth-century French plays, and because he has generously made his corpus available online, I decided to see what I could do with it.  First I transformed the texts into sequences of letters using Queneau’s schema along with P for punctuation. Here’s what the first few lines of Molière’s Tartuffe look like:

F P S P S P F F F F B P B F B F F F B S B P S P B P S P B F F F P B B F F F F B P F F F F F B F F F B P S B P B S F B F F P B F F F B F B F P S F F B F B S S P F P B F F F F B P F F B F B S P S F B F P F B F P S B F F B B S P F P

Even though there are only four letters used in this reduction of a text, those letters can still be read.  I performed the following cluster analysis of Schöch’s corpus (based on 5-grams of words, where each word is one and only one of the letters F, S, B and P):

Cluster Analysis of 17C Theatre Corpus

Cluster Analysis of 17C Theatre Corpus

At first glance the texts clustered somewhat according to author, but upon closer examination I noticed that the corpus clustered perfectly into groups of verse texts (marked with ‘-V-‘) and prose texts (‘-P-‘).  I did not expect this.  Traditional verse is determined by meter and rhyme, but Queneau’s schema reduces a text to four letters representing its parts of speech and punctuation.  In order to determine what was distinguishing verse from prose, I needed to take a closer look at the matrices.

Given any text, one can produce a transition matrix that represents the frequencies of Markov chains of bigrams based on Queneau’s schema.  Here is the transition matrix for Tartuffe:

S F B P
S 0.2158505 0.2651418 0.2738402 0.2451675
F 0.0000000 0.3850806 0.4949597 0.1199597
B 0.2442071 0.2218416 0.2063268 0.3276244
P 0.3977865 0.3662109 0.2063802 0.0296224

This gives us sixteen possible bigram combinations, although in reality there are only fifteen because FS never occurs (FS = B).  We can assign the frequency of each bigram as a distinct measurement of a text and then analyze all the texts in the corpus as 15-dimensional vectors.

Here is where PCA is very handy.  Jonathon Shlens has written a very helpful and accessible explanation of Principle Component Analysis as a method of reducing the complexity of multi-dimensional data spaces in order to more easily visualize underlying structures.  There is no way I can visualize data in fifteen dimensions, but I should be able to do it in two or three dimensions as long as I can transform the data to remove redundancies.  PCA is appropriate because the data are linear (if you add up the cells in each row of a transition matrix, you always get 1).

As a novice user of the R statistics package, I found help from Emily Mankin’s tutorialSteve Pittard’s videos and Aaron Schumacher’s explanation of 3D graphs.  After running prcomp() on the entire corpus, I determined that there are not two but three significant principle components within my 15D vector space.  On the one hand, this was a significant reduction that I could visualize, but on the other it required a triplot (a graph of three principle components) that would not be easy to render on a screen.  It is possible, however, to project biplots of each pair of principle components from the triplot.  The black dots are prose texts and the red dots are verse.  The green lines represent the rotations of the 15 variables.  I need at least three images of biplots to represent the all the relationships between PC1, PC2 and PC3:

Projection of PC1 and PC2 from a PCA triplot

Projection of PC1 and PC3 from a 3D PCA graph

Projection of PC1 and PC3 from a PCA triplot

Projection of PC2 and PC3 from a 3D PCA graph

Projection of PC2 and PC3 from a PCA triplot

The significant rotations for PC1 are SP, PF, FF, BF and FP negatively correlated with BB, SS, BS, FB, SB and PS;  those for PC2 are BF, SF and FF negatively correlated with FP, SS, PB and BP; and for PC3 the significant rotations are PP, FP and SF negatively correlated with FB, FF, BB and PB.  I’m still trying to sort this all out but the next image clearly shows how prose and verse texts separate in the triplot:

Angled

Angled projection of PCA triplot

There is a higher tendancy among verse texts toward SS (consecutive signifiers), PS (initial signifiers after punctuation), SB and BS (signifiers and a bi-words in either order).  Prose texts tend toward higher lower frequencies of SP (signifiers with no preceding formatives, followed by punctuation), FF (consecutive formatives), PB (initial bi-words after punctuation), PF (punctuation followed by formatives) and BF (bi-words followed by formatives). From these observations we could extrapolate further and say that in the syntactical structure of a text, verse tends to feature signifiers and prose tends to feature avoid formatives.

These results are of course preliminary and I need to examine the PCA analysis further, but there seems to be a definite measurable difference between verse and prose, at least in French.  And what is remarkable is that this difference does not depend on specific word choice, meter or rhyme, even though those are the qualities readers appreciate in verse.  I have completed a comparable analysis with the ABU corpus (over 200 works in French spanning many centuries) and the results are similar:  verse and prose are measurably different according to a purely syntactical analysis, with no explicit reference to semantics, phonetics or scansion.  Monsieur Jourdain would be pleased.

Reading with machines

This fall I will teach a First-Year Seminar on computer-assisted methods of text analysis.  Students will experiment with various digital tools to discover patterns in texts and use the results to inform their interpretations.

Students will first read the novel Candide by Voltaire in print or in eBook format.  They will then write and use computer programs  to perform various analyses (word frequencies, distributions, co-occurrences, etc.) to determine if and how computers can give them additional insights for understanding the novel.  They will finally build collections of documents to see how computers can help them discover patterns on a larger scale.

Once students become familiar with various computational techniques, they will apply them to a digital archive of Hartwick student newspapers.  They will build a website allowing users to browse and search the newspapers, and they will run computational analyses to determine recurring topics and trends among Hartwick students over many decades. The results of this research will be of interest to other students, faculty, staff, and alumni.

By experimenting with computers to read texts, students will learn the challenges and opportunities of project-oriented research in the humanities.  Much of the work in the Digital Humanities involves effective collaboration of people using machines.  Students will develop skills in working as part of team as well as applying new technologies to humanities research.

No prior experience with programming is required.  Students should have a Math Placement Test score of L2 or higher, and they should feel comfortable writing simple computer programs by following examples.

More information about the course is available here.

Immerse yourself in France

In January 2014 I will offer a language immersion program in Tours, France.  Students will use the French they learn as they step outside the classroom and interact with their host families, other international students, and local merchants in the royal city of Tours.  Students will also travel to Paris and be able to explore all that the City of Light has to offer.  The program will fulfill the Hartwick College language requirement:  no additional course is required.  The program is open to all students, including those who have not studied French previously.  For more information, visit the College’s website.