Okapi Components - Text Extraction Utility

- Overview
- Common Parameters
- Options - Format Tab
- XLIFF Output Options
- Table Output Options
- TMX Output Options
- Options - Options Tab
- Options - Package Tab
- Generated XLIFF

Overview

- The utility set identifier for this utility is: oku_set01
- The utility identifier is: extraction

The Text Extraction utility allows you to separate translatable text from non translatable parts of an input file and put the result in a format that can be used by translation tools.

Several output formats are available: XLIFF, Original+RTF, Table, TMX, XLIFF+RTF, and OmegaT. The output to XLIFF, OmegaT, and (with an option) Table can be merged back into the original format using the Text Merging utility.

The list of input files can contain files that have no associated filters. These files will be simply copied into the Target folder of the package. For example, when preparing a set of HTML files, you can include all images and style-sheets files that are used, they will be copied into the package.

Common Parameters

The common parameters are the options specified from the application calling the utility rather than in the options dialog box of the utility itself. For this utility the common parameters you need to specify are the following:

Files of the first input list	- Needed (the files to extract)
Root for the first input list	- Needed
Files of the second input list	- Not Needed
Root for the second input list	- Not Needed
Files of the third input list	- Not Needed
Root for the third input list	- Not Needed
Input language	- Needed
Output language	- Needed
Input default encoding	- Needed
Output default encoding	- Needed
Location and names for output files	- Not Needed

Options - Format Tab

Extract the input files into the following format -- Select the format you want to generate. There are several choices:

XLIFF (XML Localisation Interchange File Format)
Each input file is extracted into a corresponding XLIFF document, with as much information as the filter used for the file provides (i.e. identifiers, coordinates, font information, etc.). XLIFF is a standard XML format to transport localizable data. The generated XLIFF documents can be edited in any XLIFF-enabled or XML-enabled editor, then merged back into the original files using the Text Merging utility.
For more information, see the XLIFF Output Options.

Original Format + RTF Layer
Each input file is extracted into an RTF file where all the non-translatable parts and inline codes are marked up with special styles. The file is compatible with translation tools working with RTF documents such as Trados Translator's Workbench, SDLX or Wordfast. Note that not all filter can generate this type of output because they may not be text-based or they may not have RTF output capability. In these cases, the utility automatically changes the type of output to XLIFF+RTF for the files using the given filter.

Tab-delimited Table
Each input file is extracted into a corresponding tab-delimited table. The generated file is always encoded in UTF-8 and does not support inline codes.
For more information, see the Table Output Options.

TMX (Translation Memory eXchange)
Each input file is extracted into a corresponding TMX document. TMX is a standard XML format for exchanging translation memories.
For more information, see the TMX Output Options.

XLIFF + RTF Layer
Each input file is extracted into a corresponding XLIFF document and an RTF layer is put on top of the XLIFF document, all the non-translatable parts and inline codes are marked up with special styles. The file is compatible with translation tools working with RTF documents such as Trados Translator's Workbench, SDLX or Wordfast.
For more information, see the XLIFF + RTF Output Options.
XLIFF for OmegaT
Each input file is extracted into a corresponding XLIFF file designed to be processed by OmegaT. The file can be merged back into its original format using the Text Merging utility.

Format Output -- Click this button to access the dialog box where you can specify the options associated with the output format currently selected, if any are available. For details see: XLIFF Output Options, Table Output Options, TMX Output Options, and XLIFF + RTF Output Options.

XLIFF Output Options

XLIFF version -- Select the XLIFF version of the output file.

Include a <target> element for each <trans-unit> -- Set this option to output a <target> element in each <trans-unit> element generated. The content of this <target> element depends on the filter and on the options you select.

The existing translation if available, a copy of the source otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, a copy of the source text is used for the target text.

The existing translation if available, no text otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, the target text is left empty.

A copy of the source text (even if there is a translation available) -- Select this option to always set the target text to the source text, even when a corresponding translation is found in the input file.

Set any <trans-unit> with existing translation to translate='no' -- Set this option to flag all translation units that have a <target> element containing an existing translation extracted from the input file as not translatable.

Include an <alt-trans> element for each existing translation -- Set this option to generate an <alt-trans> element for each entry that has an existing translation. The content of the <target> element in this <alt-trans> element is always set to the target text, regardless what option is specified for the text of the <target> element at the <trans-unit> level.

Include notes when available -- Set this options to generate <note> elements when the filter provides such information.

Use placeholder notation (<g></g> and <x/>) -- Set this option to output the inline codes using XLIFF placeholder elements <g> and <x/> rather than the encapsulating elements like <bpt>, <ept> and <ph>.

Include word-count -- Set this option to include the word-count of each item in its corresponding <source> element.

The output to XLIFF and Table can be merged back into the original format using the Text Merging utility. For more information about the type of XLIFF markup generated by this utility see the section Generated XLIFF. For more information about XLIFF in general see the XLIFF Web site.

Table Output Options

The Table output format is a simple output using extracted text as it is, that is without isolating any possible inline codes in special markers unlike the other formats are able to do. This may work or not work depending on the original file format: some filters (like for XML or HTML) may do special escapes when merging back and inline codes "seen" as text as in the Table output format may cause incorrect merging.

In addition, because the raw character Tab, Carriage-Return and Line-Feed would break the table layout, they are always escaped (Tab=<\T>, CR+LF=<\RN>, CR=<\R>, and LF=<\N>. The file is always in UTF-8 and may have an optional header line. The layout of the table is the following:

Column 1: The ID of the item.
Column 2: The source text.
Optional next column: The resname of the item (or its ID if there is no resname)
Optional next column: The target text (according the options you specify).

Include a header line -- Set this option to output a line of column titles at the top of the table. This line is mandatory if you want to merge back the extracted text into its original format later one.

Include a column with the resname value -- Set this option to output, just after the source column, a column containing the value of the resname property (i.e. same as the resname attribute in XLIFF) for each text item. If no resname is provided the sequential id value of the item (same as the id attribute in XLIFF) is used instead.

Include a last column with the target text -- Set this option to output a target column at the right end of the table. The content of this column depends on the filter and on the options you select.

TMX Output Options

Include a target <tuv> element for each <tu> -- Set this option to output a <tuv> element set to the target language in each <tu> element generated. The content of this <tuv> element depends on the filter and on the options you select.

The existing translation if available, a copy of the source otherwise -- Select this option to set the target text with the existing translation of the item when it is available. If no corresponding translation is found in the input file, a copy of the source text is used for the target text.

The existing translation if available, no text otherwise -- Select this option to set the target text with the existing translation of the item when it is available. If no corresponding translation is found in the input file, the target text is left empty.

A copy of the source text (even if there is a translation available) -- Select this option to always set the target text with the source text, even when a corresponding translation is found in the input file.

Remove leading and trailing white-spaces -- Set this option to ensure that all leading and trailing white-spaces are removed from the TMX segments. Text without leading and trailing white-spaces is true TMX compliance. Note that this may result in empty entries.

For more information about TMX, see the TMX Web site.

XLIFF + RTF Output Options

Include notes when available -- Set this options to generate <note> elements when the filter provides such information.

If you strip out the RTF layer, the resulting XLIFF document can be merged back into the original format using the Text Merging utility. For more information about the type of XLIFF markup generated by this utility see the section Generated XLIFF. For more information about XLIFF in general see the XLIFF Web site.

Options - Options Tab

Extract only the text items with a leading marker -- Set this option to extract only items that have the text specified in the Marker edit field at the very beginning. For example "[$TBT$]Text to localize" where "[$TBT$]" is the marker. The marker itself is removed during the extraction, so the text is presented ready for translation. When this option is set the text items that have not the marker are set as non-translatable.

Markers -- Enter one or more markers that will be used as a flag to know what items need to be translated. The text is case-sensitive. If you have several markers they should be separated by a comma or a semi-colon. All white-spaces between markers are ignored. (e.g. "[$TBT>; [$TBE>").

Example of input file with markers. Only the entries with "[$TBT$]" will be translated:

# messages.properties v1.3
# Last update: April-26

error.internal.badmsg = Erreur interne. Message invalide.
error.filenotfound    = [$TBT$]File {0} not found.
error.badinst.nojre   = Erreur d'installation.
itemfound             = [$TBT$]Item '{0}' has been found in '{1}'.

Resulting RTF output file (translatable text in black, things not to touch in gray):

# messages.properties v1.3
# Last update: April-26

error.internal.badmsg = Erreur interne. Message invalide.
error.filenotfound    = File {0} not found.
error.badinst.nojre   = Erreur d'installation.
itemfound             = Item '{0}' has been found in '{1}'.

Create a TMX output file with any pre-translated entries found -- Set this option to generate a TMX file that contains all the source+target pairs found during the extraction process. This option is used for bilingual input files such as PO files, where some translations may be already available.

Filename -- Enter the name of the TMX output file that will be generated. The file will be placed in the package folder. If you do not specify an extension, the .tmx extension will be added automatically. This is a filename, not a path. If you enter a path, only the filename part will be taken in account. The TMX file generated will overwrite any existing file with the same name at the given location.

List font information when available -- Set this option to generate a list of the font used (with extracted items) in the processed files. Note that such information may not be available with all filter.

List fonts for each input file -- Set this option to list the fonts for each input file (when it is available). If this option is not set, only a summary of all fonts used at least once in all files is generated. This option is enabled only when the option List font information when available is set.

The two options above work for filters that return a UsedFonts property when reaching the end of file (i.e. when ReadItem() return ENDINPUT). the property value must be a list of all the fonts used by translatable items in the file that was processed. The font names must be separated by a tabulation character.

Add reference information using the following lookup file -- Set this option to add additional information along with each extracted item. The data to associate is found in the tab-delimited file you specify below the check box. This option applies currently only to Original Format + RTF Layer output format.

The tab-delimited lookup file is expected to be in UTF-16 or UTF-8 encoding and with the following format: The first column is the data, and the second column the resname value of the associated item. If an item has several data, they are all included. Values of the columns should not be between quotes. Any line without a tab is considered a comment and is skipped.

Generate any ancillary data available -- Set this option to generate any ancillary output the filters may be able to produce. Some file format for example have embedded graphics and their filter may have the capability to create separate files for these graphics.

Options - Package Tab

Output folder -- Enter the path of the folder where the package should be created. The folder and any required parent folders will be created automatically if they do not exist yet.

Package name -- Enter the name of the package to create. A sub-folder of that name will be created in the output folder, and the different output will be generated under that sub-folder.

For both the output folder and the package name you can use the variable placeholders <SrcLangCode> and <TrgLangCode> to represent the current input language code and the current output language code (or <SrcLangCodeU> and <TrgLangCodeU> for the codes in uppercases). The placeholders are replaced by their value when executing the extraction. For example, the package name Pack_<TrgLangCode> will be generated as Pack_fr-FR if the current output language is defined as fr-FR.

The generated files will be distributed in the following structure:

[Output Folder]\[Package Name]\Work\<all the extracted files>
[Output Folder]\[Package Name]\Original\<all the original files>
[Output Folder]\[Package Name]\Target\<all the original non-extractable files>

Copy the original files in the output package -- Set this option if you want the original files to be copied in the Original sub-folder created under the package folder.

Create a Rainbow project for merging any XLIFF output -- Set this option to generate automatically a Rainbow project file for merging back any of the output files extracted to XLIFF. If this option is set but and no file is output to a format that can be merged then no Rainbow file is generated. If a Rainbow file is generated it is placed in the Target folder with the name _Merge.rbp, along with a _Merge.bat batch file to execute it.

In addition, all the parameters files used to do the extraction are copied in this folder as well, and the merging information in each extracted file points to these parameters files. Note that even if you use default parameters for the extraction, a parameters file will be generated in the Target folder. This ensure that the merging is done with the exact same parameters as for the extraction, even if some shared parameters files are modified on your machine between the extraction and the merging.

For the XML Filter parameters files: If the options are using a declared ITS rules document, the given rules--along with any linked rules--are compiled into a new rules document that is copied to the Target folder. The the parameters file itself is also copied there and is modified to point to this new rules document. ITS rules documents linked from the original source document are not touched.

Create an Horizon settings file (.hrs) -- Set this option if you want to generate a settings file that can be used with Horizon to browse through the prepared files. When this option is set, the option Include the original files in the output package is forced and becomes inaccessible. The Horizon settings file generated will have the same name as the package with a .hrs extension. This option is useful when extracting the input files into RTF or XLIFF+RTF.

Note that there are a few compatibility issues with Horizon version 3.x:

Rainbow version 5.x allows to specify languages that are not, by default, available in Horizon version 3.x, for example language-only or user-defined languages. When the language code specified in Rainbow does not exist in Horizon, the default language for Horizon is substituted.
Some encoding names of Rainbow version 5.x are different from the one used in Horizon version 3.x. When the encoding name specified in Rainbow does not exists in Horizon, the default encoding name for the selected language is used.
Rainbow version 5.x allows to define an output encoding for each input file if desired, while Horizon version 3.x uses the same output encoding for all files. When creating the Horizon settings file Rainbow uses the encoding defined for the first input file (which may or may not be the default output encoding, depending on the properties for that input file).

Most of the time, these issues do not rise. When they do, corrections can simply be done in Horizon, by specifying the correct information after the Horizon settings file has been generated. In Horizon, select the Settings option from the File menu, then select the Conversion tab, and if necessary, reset the language and encoding information.

Create an OmegaT project file -- Set this option to generate a project file and the relevant folders for OmegaT. OmegaT is a open-source translation tool. When you select this option, the Work output folder is called source, and the Target output folder is called target, the optional TMX output is placed in the tm folder, and the empty folders omegat, and glossary are also generated.

Generated XLIFF

Whether they have an RTF layer or not, the XLIFF files generated by the Text Extraction utility have the following characteristics:

Okapi XLIFF Extensions

The files use a few Okapi-specific attributes that are defined through the okapi-framework:xliff-extensions namespace. This namespace usually uses the okp prefix.

okp:settings - Stores the settings string that should be used to re-process the original input file. This is always used during the merging process. For more information about the format of this string, see Filter settings string parameter of the LoadSettings method of the Filter Interface. Note that the value of this attribute depends on whether the option Create a Rainbow project for merging any XLIFF output is set or not. If it is set, the value points to the parameters file created in the Target folder along with the merge project. If it is not set, the value is the same as the settings used for the extraction.
okp:encoding - Indicates the default encoding of the original input file. This may or may not be used during the merging. The value must be an IANA charset name.

The files must have these attributes intact when you merge back the XLIFF documents with the Text Merging utility.

Adding Inline Codes

You can add inline codes in the XLIFF documents by using <bpt>/<ept> or <ph/> elements with an id attribute set to zero. For example, in the entry below the codes <BOLD> and </BOLD> have been added to the entry:

<target xml:lang="fr-FR">This section <bpt id="0">&lt;BOLD></bpt>must<bpt id="0">&lt;/BOLD></bpt>
be read by the user.</target>

The added content must result in valid codes for the original format (XML, HTML, etc.) once it has been merged back. The tools adding such inline codes are responsible to create the correct codes.

Removing Inline Codes

You can remove inline codes in the XLIFF documents by deleting the content of the relevant <bpt>/<ept> or <ph/> elements. If you delete the content of a <bpt> element, you must also delete the content of its corresponding <ept> element (both have the same id value). For example, in the entry below the codes stored in <bpt> and <ept> have been removed.

<target xml:lang="fr-FR">This section <bpt id="1"></bpt>must<bpt id="1"></bpt>
be read by the user.</target>

The modified entry must result in a valid entry for the original format once it has been merged back.