Skip to content

Encoding Hilltops

For the next two weeks we will try our hands at text encoding. At first this assignment will seem tedious: we will each create an XML document representing an issue of Hilltops using text files and image files from scanned documents to decide how to apply tags according to OHCO (see this assignment for an explanation). But in order for a computer to be able to process text effectively, humans have to encode text so that its structure is well-defined. Very often it is not obvious how a text's structure should be defined. You will need to copy the scanned text to a file, compare it to its image, and decide how to represent it using the example of an XML file that I have prepared. I will explain how I encoded the XML file but you may encounter different structures that require different encoding. We will work as a class to decide what the OHCO structure should be and then encode the text to represent the structure.

Here is what you need to do to get started with text encoding Hilltops:

  1. Download and install jEdit, a text editor that offers tools for XML text encoding. You may need to install Java first (I think most computers come with Java already installed).
  2. Install the XML plugin for jEdit.
  3. Open the file Hilltops_19281200.xml in jEdit. We will look at the structure of this file together so that you understand how it is structured.
  4. Begin working on encoding another issue of Hilltops using text and image files in the GitHub repository for the course (you should resync your copy of the repository to get updates since the last class).
  5. You should first copy and paste from the text files for each page of the Hilltops issue assigned to you to a new XML file using the Hilltops_19XXXXXX.xml template file. You will change the name of the file to indicate the specific year and month for your issue (e.g. "Hilltops_19281200.xml" represents the December 1928 issue). You should copy and paste between the <body> and </body> tags.
  6. Here are the first and last files for each issue we will edit:
    • Issue 1 (Dec. 1928): 1928-1929_0000027 to 1928-1929_0000045
      (Mark Wolff)
    • Issue 2 (Jan. 1929): 1928-1929_0000051 to 1928-1929_0000070
      (Brendan Andrews)
    • Issue 3 (Feb. 1929): 1928-1929_0000079 to 1928-1929_0000098
      (Joseph Avelino)
    • Issue 4 (Mar. 1929): 1928-1929_0000105 to 1928-1929_0000126
      (Shane Black)
    • Issue 5 (Apr. 1929): 1928-1929_0000135 to 1928-1929_0000154
      (Walter Casey)
    • Issue 6 (May 1929): 1928-1929_0000163 to 1928-1929_0000185
      (Nicholas Checchia)
    • Issue 7 (Nov. 1929): 1929-1930_0000195 to 1929-1930_0000214
      (Nicholas Couvaris)
    • Issue 8 (Dec. 1929): 1929-1930_0000217 to 1929-1930_0000238
      (Jayden Feliciano)
    • Issue 9 (Jan. 1930): 1929-1930_0000241 to 1929-1930_0000255
      (Christopher Gamber)
    • Issue 10 (Feb. 1930): 1929-1930_0000257 to 1929-1930_0000270
      (Germain Nicole)
    • Issue 11 (Mar. 1930): 1929-1930_0000273 to 1929-1930_0000290
      (Gina Grauer)
    • Issue 12 (Apr. 1930): 1929-1930_0000293 to 1929-1930_0000309
      (Maria Iqbal)
    • Issue 13 (May 1930): 1929-1930_0000311 to 1929-1930_0000327
      (Benjamin Johnson)
    • Issue 14 (Nov. 1930): 1930-1931_0000338 to 1930-1931_0000360
      (David Lee)
    • Issue 15 (Dec. 1930): 1930-1931_0000362 to 1930-1931_0000390
      (Liam Martin)
    • Issue 16 (Jan. 1931): 1930-1931_0000392 to 1930-1931_0000422
      (Suzanne Phillips)
    • Issue 17 (Feb. 1931): 1930-1931_0000426 to 1930-1931_0000445
      (Joseph Saracino)
    • Issue 18 (Mar. 1931): 1930-1931_0000448 to 1930-1931_0000464
      (Jared Hoff)
    • Issue 19 (Apr. 1931): 1930-1931_0000466 to 1930-1931_0000489
      (Arlind Malziu)
    • Issue 20 (May 1931): 1930-1931_0000492 to 1930-1931_0000522
  7. Insert tags like <p> and </p> where you see document structure. Here is a brief list of the tags used in Hilltops_19281200.xml:
    • <div1 type="poetry">...</div1> : a section of text
    • <div2 type="poem">...</div2> : a subsection of text
      (if necessary you can use <div3>, <div4>, etc. for sub subsections)
    • <head>...</head> : a header of a (sub)section
    • <byline>...</byline> : the author of an article, poem, etc.
    • <lg>...</lg> : line group (of poetry), used to indicate stanzas
    • <l>...</l> : line (of poetry)
    • <list>...</list> : list
    • <item>...</item> : item in a list
    • <p>...</p> : paragraph
    • <q>...</q> : quote
    • <table>...</table> : table
    • <row>...</row> : row in a table
    • <cell>...</cell> : cell in a row in a table
  8. A tag like <pb n="1" id="1928-1929_0000027"/> is used to indicate a page break. Notice the slash "/" at the end of the tag: we are not indicating a chunk of text with the tag but a milestone in the text. Milestones do not follow OHCO because they do not represent the structure of a segment of text; instead, they mark a "milestone" or point in the text (in this case, a new page). We will use the id attribute to display the page image with server software that understands what <pb id="..."/> means. See the Hilltops_19281200.xml file for an example of how the milestone tag is used.
  9. You can check your encoding to ensure it is valid (that is, no mistakes in OHCO structure) using the "Parse as XML" command in jEdit. If you have an error, you will be able to find it with a message from jEdit and then correct it.
  10. Eventually you will complete your encoded file for an issue of Hilltops and it should parse with no problems. Once you achieve this, you can push your new file to the GitHub repository and I will review it. If it looks good, I will pull your new file into the repository and it will become part of the master copy.