Akoma Ntoso and Computer Assisted Translation (CAT) Systems

One of the problems faced, when translating legal documents produced in MS-Word into other languages using Computer Assisted Translation (CAT) tools is that, the translated document needs to be re-edited and re-formatted to fix its layout.

This problem is particularly evident when translating across languages with different character sets i.e. english => chinese or english => arabic. The layout issues would not be a deal breaker, if the documents were intended just for online publication. However, in most cases these documents are translated for print production; in some cases they are time sensitive and need to be printed and published after translation.

The Problem

Documents done in MS-Word mix presentation with content, which is why when they get translated the layout also gets “translated” – and we end up with a document which does not look as we expected.

I have seen cases where translated documents had to be reformatted over and over again – because during the production workflow, the original english document changed in content significantly and had to be translated again during multiple intermediate print publication steps (e.g. 1st reading, 2nd reading, committee review) – and during each of these the translated document had to be reformatted all over again.

For example – below is an english “Provisional Agenda” document, note the styling information which is part of the content (marked and named in violet)

Word Document (FAO Provisional Agenda) with content and style (ENGLISH)

And see below the same document translated into Arabic:

Arabic Provisional Agenda
Word Document (FAO Provisional Agenda) with content and style (ARABIC)

Notice how the Arabic translation of the document has subtle differences (apart from the obvious right-to-left rendering of text) – in terms of font faces, font sizes, alignments etc. Such changes are not automatically applied when the english document is translated into Arabic, and they need to be applied into the translated document by editing it.

All this work and re-work is required just to produce one output format for printing – if we consider that there are multiple mediums for publication nowadays – mobile, web, tablet-devices, other external services and data consumers, doing each of these will require further re-engineering and conversion from Word to the target platform. Typically what happens is that because of the huge amount of effort invested in producing the correct print format in so many different languages, there are no more resources or energy left to do these others any justice.

What if the legal documents were drafted in Akoma Ntoso XML instead of MS-Word ? Could that work better with CAT systems ?

How Akoma Ntoso helps here

If the english legal document had been drafted in Akoma Ntoso format, manual reformatting of the Arabic translation would not have been required.

This is because the Akoma Ntoso XML legal document does not contain any styling and page dimension dependent information like: font faces, font sizes, margins, header & footer sizes, and page numbers. These are defined and encoded in a separate file with this information called an XSLT, which when applied on the XML document produces the desired output:

How XSLT Works

These rules are applied in an automated manner, so the output is always predictable and consistent. Any formatting issues can be addressed centrally in the XSLT rules and all the target documents can be reproduced with the correct formatting literally at the press of a button.

Print Ready outputs

For producing documents specifically for printing — XSLT provides a specific extension to the technology called XSL-FO (XSL Formatting Objects) that allows producing print ready PDF documents with precision layouts, page numbering, margins, and support for foot-notes, table of contents etc.

How XSL-FO Works

Other targets

XSLT can be used to produce any other text or mark-up based target out of the Akoma Ntoso XML legal document via XSLTs for each target format. One common requirement is to render the XML to HTML to present it on a web-page, and XSLT is a very easy way to render XML to HTML. Doing this is significantly simpler than converting the XML to JSON and then rendering the JSON to HTML via Javascript ( I mentioned that because I see many developers using that approach).

Why is Akoma Ntoso in XML and not HTML or JSON or PDF?

This is a common question that an implementer new to Akoma Ntoso asks – why was Akoma Ntoso required – why could it not have been expressed as a HTML document or in a JSON document since those are easier to work with. I will start by stating Akoma Ntoso XML is a document format for representing legal documents.

HTML

Lets look at HTML first – it is a simple format designed to primarily support presentation, while Akoma Ntoso is intended not just for presentation, but also print publication and semantic service access. There is limited support for representing structure, and semantics related to legislation, isn’t really explicitly part of the schema. HTML by its nature and use imposes few structural rules and even those are not imposed very strictly. For these reasons HTML is not a good substitute for Akoma Ntoso XML.

PDF

PDF is a proprietary format, primarily for guaranteeing the visual presentation of documents for print production. There is no support for structure, neither for semantics related to the structure. PDF is relevant only when the most important requirement is that the visual presentation of the document has to be retained and consistent.

JSON

JSON is currently perhaps the most popular format used by developers, a lot of software development nowadays involves using javascript based frameworks and using noSQL databases like MongoDB which use JSON internally. However, none of these present valid reasons to use JSON instead of XML for Akoma Ntoso. JSON was primarily designed as a data-interchange format rather than a long term storage format. XML provides mature and developed infrastructure to validate the integrity of a document (using XML schema rules, and business rule languages like schematron); XML via XSLT and XSL-FO provides a pipeline of technologies to convert the XML to different formats for both online and print production. Presently doing any of these things in JSON requires writing your own validation or conversion code.

JSON is great for representing data structures, but not suited for representing documents and their content. For example, the following is a fragment of text from a typical legal document:

This is random text with bold text and within it some italic text and again bold in the same sentence

i.e. some text which has some bold text within and some italic text within the bold text, in XML this can be represented as below:

<p eId="p_1">
    This is random text with <strong class="bigger">bold text and within it  
        <i class="slanted">some italic text</i>  and again bold in the same sentence</strong>.
</p> 

If you had to represent the same in JSON, it would look like:

{"p": {
    "eId": "p_1",
    "content": [
        "This is a small piece of text containing",
        "."
    ],
    "b": {
        "class": "bigger",
        "content": [
            "bold text and",
            {"i": {
                "class": "slanted",
                "content": "bold and italic text"
            }},
            "in the same sentence"
        ]
    }
}}

Apart from the representation itself being more complicated than the XML one, there are other problems with it:

Since JSON does not distinguish between attributes (i.e. local metadata) and content itself, we need to introduce a named structure called “content” for the content, and assume that all other properties are equivalent of attributes.