Okapi Components - TMX Language Duplicates Splitting Utility

Overview

- The utility set identifier for this utility is: oku_set04
- The utility identifier is: tmxsplittingdup

The TMX Language Duplicates Splitting utility allows you split into separate <tu> elements the <tuv> elements that are of the same language inside a given <tu> element.

For example, if you have the following TMX file:

<tmx version="1.4">
 <header creationtool="XYZ" creationtoolversion="1.0"
  datatype="plaintext" segtype="sentence"
  adminlang="en-US" srclang="en-US" o-tmf="WXYTool">
</header>
 <body>
  <tu tuid="1">
   <tuv xml:lang="en-US">
    <seg>Efficiency is intelligent laziness.</seg>
   </tuv>
   <tuv xml:lang="fr-FR">
    <seg>L'efficacité c'est la paresse intelligente.</seg>
   </tuv>
   <tuv xml:lang="es-ES">
    <seg>La eficacia es la holgazanería inteligente.</seg>
   </tuv>
   <tuv xml:lang="fr-FR">
    <seg>L'efficacité est la fille de la paresse.</seg>
   </tuv>
  </tu>
  <tu tuid="2" srclang="fr">
   <tuv xml:lang="en">
    <seg>Item text</seg>
   </tuv>
   <tuv xml:lang="fr">
    <seg>Texte de l'article</seg>
   </tuv>
   <tuv xml:lang="en">
    <seg>Text of the article</seg>
   </tuv>
  </tu>
 </body>
</tmx>

The utility will generate the following output:

<tmx version="1.4">
 <header creationtool="XYZ" creationtoolversion="1.0"
  datatype="plaintext" segtype="sentence"
  adminlang="en-US" srclang="en-US" o-tmf="WXYTool">
 </header>
 <body>
  <tu tuid="1">
   <tuv xml:lang="en-US">
    <seg>Efficiency is intelligent laziness.</seg>
   </tuv>
   <tuv xml:lang="fr-FR">
    <seg>L'efficacité c'est la paresse intelligente.</seg>
   </tuv>
   <tuv xml:lang="es-ES">
    <seg>La eficacia es la holgazanería inteligente.</seg>
   </tuv>
  </tu>
  <tu tuid="2" srclang="fr">
   <tuv xml:lang="en">
    <seg>Item text</seg>
   </tuv>
   <tuv xml:lang="fr">
    <seg>Texte de l'article</seg>
   </tuv>
  </tu>
  <tu tuid="1">
   <tuv xml:lang="en-US">
    <seg>Efficiency is intelligent laziness.</seg>
   </tuv>
   <tuv xml:lang="fr-FR">
    <seg>L'efficacité est la fille de la paresse.</seg>
   </tuv>
  </tu>
  <tu tuid="2" srclang="fr">
   <tuv xml:lang="fr">
    <seg>Texte de l'article</seg>
   </tuv>
   <tuv xml:lang="en">
    <seg>Text of the article</seg>
   </tuv>
  </tu>
 </body>
</tmx>

Some things to keep in mind when using this utility:

The utility assumes that the source language has never duplicates.
The utility assumes that, when there is only two <tuv> elements in a <tu> element, there are never duplicates (because it assumes one is in the source language, while the other is in a target language).
Output files are created only for input files where at least one duplicated <tuv> element has been found.
The new <tu> element have the same tuid attribute as the original <tu>.
The new <tu> elements are located anywhere in the file (not necessarily near the original <tu>).
The new <tu> elements contain only the source <tuv> and the duplicated language <tuv> of the original <tu>.

Common Parameters

The common parameters are the options specified from the application calling the utility rather than in the options dialog box of the utility itself. For this utility the common parameters you need to specify are the following:

Files of the first input list	- Needed (the TMX files to process)
Root for the first input list	- Not Needed
Files of the second input list	- Not Needed
Root for the second input list	- Not Needed
Files of the third input list	- Not Needed
Root for the third input list	- Not Needed
Input language	- Not Needed
Output language	- Not Needed
Input default encoding	- Not Needed
Output default encoding	- Not Needed
Location and names for output files	- Needed

This utility has no options.