RSS

Archivi tag: text encoding

TEI Embedded Transcription support in EVT

Since it was originally born as part of the Digital Vercelli Book project (http://vbd.humnet.unipi.it/), EVT was developed to deal with the XML encoding of texts which had been prepared for that project, namely making use of the XML TEI P5 parallel transcription method (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PH-bov). When using this method, information about the scan and possibly the coordinates of sensible areas are separated from the transcription and aligned with it thanks to linking attributes.

EVT-0.1.48-02The Vercelli Book Digital beta version using EVT

But, as it is possible to read in the TEI Guidelines, the scholar can choose to emphasize the importance of the physical surface and to encode words and other written traces as subcomponents of the XML elements representing the physical surface carrying them, rather than independently of them. This kind of encoding scheme is known as embedded transcription (www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PHZLAB), and thanks to support from EADH (see EADH Small Grant: Call for Proposals at http://www.eadh.org/support/eadh-small-grants-call-proposals) this feature was added to the EVT software. The development took place in the period between May and July 2014.

Main changes to the original software

The main changes we implemented are mainly related to the identification and split of the text into different folios and the creation of the structure for the image-text linking tool.

Since the Vercelli Book transcription was completely encoded according to the parallel transcription method, we used different texts in order to have proper examples of embedded transcription; in particular, we used the TEI examples available in the Guidelines (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PHZLAB) and the encoded text of the Slovenian «Tri Pridige O Jeziku (three sermons on language)» (http://nl.ijs.si/e-zrc/slomsek/index-en.html).

EVT-Slomsek-diplomaticThe Tri Pridige O Jeziku edition, an ET transcription, using EVT

First of all, we added an automatic detection of the encoded scheme used in the text that is being transformed (Parallel Transcription or Embedded Transcription): this identification is based on the absence/presence of the <sourceDoc> element, which is only used in ET.

If the system finds at least one <sourceDoc> element, the text will be treated as being encoded in ET: thus, each <sourceDoc> will be handled as a different document and each <surface> element, both when child of <sourceDoc> and when child of a <surfaceGrp> element, will be used to generate a single textual fragment.

We decided to consider the <surfaceGrp> element just as a mere generic division inside the code that does not produce any particular output in the interface. Moreover, even if the possible nestings of <surfaceGrp> are infinite, at the present moment the software is only able to support two levels.

The most important element after <sourceDoc> and <surface> is the <zone> element. This will be used to create the elements required for the activation of the image-text linking tool and of the hotspot tool.

A <zone> can be an empty node linked to one or more textual nodes, making use of the <line> element, or it can contain the text directly, without any further sub-elements. We considered both cases, therefore the XSLT transformation for the image-text linking tool will be activated with:

  • a <zone> element that contains some text and has the spatial coordinates attributes @ulx, @uly, @lrx and @lry. In this case, each sensitive area of the image identified by the previous coordinates will highlight all the text that was nested in the <zone>, even if it is distributed on more lines.

ET-TEIexample-1-EVT

ET-TEIexample-1-code

  • an empty <zone> element that has the spatial coordinates attributes and a reference to the particular <line> element it is linked to. Similarly to the previous case, each sensitive area of the image identified by the coordinates will highlight the text inside the element linked to the particular <zone>.

When the <zone> is missing the spatial coordinates attributes, the text-linking tool will not work, but the corresponding text (both if it is inside or outside the <zone> itself) will be rendered in the interface and nested in a particular HTML container (<div class=” *edition_level*–Zone “>), in such a way that the user can visually distinguish the separation between the different <zone> elements; the specific class (one for each edition level configured) allows to easily customize the visualization of the <zone> on the browser.

Instead, if the <zone> element is an empty node and the reference between it and the textual node is missing or broken, the text will properly appear on the page, but the image-text linking tool will not work for it, even if the <zone> had the spatial coordinates attributes.

As said before, in some cases the <zone> element will be used to generate an HotSpot, that is a sensible area on the image directly linked to a HTML pop-up window. We have decided to consider as a hotspot:

  • every <zone> that has the spatial coordinates attributes and is nested inside another <zone>; in this case the textual box will contain the text of the innermost zone;

ET-TEIexample-2

  • every <zone> (even if a direct child of <surface>) that contains a <graphic> element with a @url attribute; in this case the textual box will contain the image referenced by the <graphic> element and the text inside the <zone> itself (if present).

ET-TEIexample-3

All hotspots that were handled by means of the @rendition attribute in texts encoded in PT, will likewise work fine with texts encoded in ET.

Future developments

The remarkable variety of possible encodings available when using the Embedded Transcription method has made the task of supporting it more complicated than we expected. As it is clear from the section above, at least in this phase of development our support is somewhat “prescriptive”, in the sense that not every possible encoding is supported. This means that text encoded according to “reasonable” principles will very likely work, while in other cases it may or may not work and thence require some modifications to the encoded text. Since this is the first version of EVT supporting the Embedded Transcription method, we expect further improvements on the basis of users’ feedback: as always, feel free to contact us with your remarks, suggestions and feature requests. We have contacted the authors of the «Tri Pridige O Jeziku (three sermons on language)» edition mentioned above, and will experiment together with them with the goal of fine tuning ET support in EVT.

Contact

EVT Project editionvisualizationtechnology@gmail.com
Roberto Rosselli Del Turco roberto.rossellidelturco@gmail.com

List of participants

Chiara Di Pietro (dipi.chiara@gmail.com)
Julia Kenny (julia.kenny90@gmail.com)
Raffaele Masotti (raffaele.masotti@gmail.com)

References

Digital Vercelli Book project: http://vbd.humnet.unipi.it/.
EADH Supported activities and reports: http://www.eadh.org/support/supported-activities-and-reports [full report submitted to EADH].
Edition Visualization Technology: http://sourceforge.net/projects/evt-project/.
Digital Vercelli Book beta version using EVT: http://vbd.humnet.unipi.it/beta.
TEI P5 Parallel Transcription: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PH-bov.
TEI P5 Embedded Transcription: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/
PH.html#PHZLAB.
Tri Pridige O Jeziku (three sermons on language), http://nl.ijs.si/e-zrc/slomsek/index-en.html.

Post based on the final report by Chiara Di Pietro and Julia Kenny, revised and modified by R. Rosselli Del Turco.

Annunci
 
1 Commento

Pubblicato da su novembre 11, 2014 in articles, evt

 

Tag: , , ,

Edward Vanhoutte’s Blog: So You Think You Can Edit? The Masterchef

“Both this blogpost and the journal article come without the chocolates.” 😉

http://edwardvanhoutte.blogspot.com/2011/10/so-you-think-you-can-edit-masterchef.html

 
Lascia un commento

Pubblicato da su ottobre 14, 2011 in Uncategorized

 

Tag: , , ,