Okapi Components - PO Filter

- Overview
- Filter Properties
- Processing Details
- Parameters - Options Tab
- Parameters - Inline Codes Tab
- Parameters - Output Tab
- Credits

Utilities

Filters

Shared Help

Okapi Framework

Overview

The PO Filter is an Okapi component that implements the Okapi Filter Interface for PO (Portable Object) resource files.

The filter is based on the PO specifications found in the GNU gettext manual.

The following is an example of a very simple PO file. The translatable text is marked in bold. Note also the header information in the first entry (the one with an empty msgid line), where encoding and plural information may be found.

# PO file for myApp

msgid ""
msgstr ""
"Project-Id-Version: myApp 1.0.0\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2005-10-02 05:16+0200\n"
"PO-Revision-Date: 2005-03-21 11:28/-0600\n"
"Last-Translator: unknown <email@address>\n"
"Language-Team: unknown <email@address>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

msgid "diverging after version %d of %s"
msgstr ""

msgid "You have selected %d file for deletion"
msgid_plural "You have selected %d files for deletion"
msgstr[0] ""
msgstr[1] ""

domain ErrorMsg

msgid "Cannot find %s."
msgstr ""

Filter Properties

The properties for the PO Filter are the following:

Property	This Filter
INPUTFILE	Yes
INPUTSTRING	Yes
BILINGUALINPUT	Yes
TEXTBASED	Yes
OUTPUTFILE	Yes
OUTPUTSTRING	Yes
ANCILLARYOUTPUT	No
XMLOUTPUT	No
RTFOUTPUT	Yes
USEKEY	No
ISINDEMOMODE	No

Processing Details

Input Encoding

The filter decides which encoding to use for the input file using the following logic:

If the file has a Unicode Byte-Order-Mark:
- The corresponding encoding (e.g. UTF-8, UTF-16) is used.
Else, if a header entry with a charset declaration exists in the first 2048 characters of the file:
- The declared encoding is used. Note that if the encoding has been detected from a Byte-Order-Mark and the encoding declared in the header entry do not match, a warning is generated and the encoding of the Byte-Order-Mark is used.
Otherwise, the input encoding of the common parameters is used.

Output Encoding

If the file has a header entry with a charset declaration, the declaration is automatically updated in the output to reflect the encoding selected for the output.

Output Language

No language information is updated in the PO header entry. If the output language is different from the input language, any language-related information need to be updated manually in the header entry of the generated PO file.

Localization Directives

The PO Filter supports localization directives. They are special comments you can use to override the default behavior of the filter regarding the parts to extract. Such directives have little use in a PO file as the format is already geared toward localization, so the processing of localization directives is not turned on by default. Set the option Use localization directives when they are present to turn it on.

The syntax and behavior of the directives are the same across all Okapi filters. See the Localization Directives pages for detail information about what you can do with the mechanism.

Note that any directives within an entry (i.e. anywhere between msgid and msgstr or between several msgstr of a plural form) are ignored.

Correct entry:

#_skip
msgid "do not translate"
msgstr "do not translate"

Incorrect, the directive will be ignored:

msgid "do not translate"
#_skip
msgstr "do not translate"

Plural Forms

The filter supports plural forms entries with the assumption that they are in a sequential order. that is, msgstr[0] comes first, then msgstr[1], etc. All the msgstr strings of a given plural entry are processed as part of a single group that has its restype value set to 'x-gettext-plurals'.

The resname value generated for plural form entries has an additional index indicator. For example the following entry with plural form entry:

msgid "untranslated-singular"
msgid_plural "untranslated-plural"
msgstr[0] "translated-singular"
msgstr[1] "translated-plural-form1"
msgstr[2] "translated-plural-form2"

Will generate the items for constructing the following XLIFF block:

<group restype="x-gettext-plurals">
 <trans-unit id="1" resname="P3ADE34F0-0" xml:space="preserve" translate="no">
  <source xml:lang="en-US">untranslated-plural</source>
  <target xml:lang="fr-FR">translated-singular</target>
 </trans-unit>
 <trans-unit id="2" resname="P3ADE34F0-1" xml:space="preserve" translate="no">
  <source xml:lang="en-US">untranslated-plural</source>
  <target xml:lang="fr-FR">translated-plural-form1</target>
 </trans-unit>
 <trans-unit id="3" resname="P3ADE34F0-2" xml:space="preserve" translate="no">
  <source xml:lang="en-US">untranslated-singular</source>
  <target xml:lang="fr-FR">translated-plural-form2</target>
 </trans-unit>
</group>

Domains

The domains are supported as groups, with the restype value set to 'x-gettext-domain' and the resname value set to the group identifier.

For example, the following entry:

domain TheDomain1
msgid "Text 1 in domain 'TheDomain1'"
msgstr "Texte 1 dans le domain 'TheDomain1'"

Will generate the items for constructing the following XLIFF block:

<group resname="TheDomain1" restype="x-gettext-domain">
 <trans-unit id="1" resname="N9D1999AB" xml:space="preserve">
  <source xml:lang="en-US">Text 1 in domain 'TheDomain1'</source>
  <target xml:lang="fr-FR">Texte 1 dans le domain 'TheDomain1'</target>
 </trans-unit>
</group>

Line-Breaks

The line-breaks type of the output are the same as the one of the original input.

Parameters - Options Tab

Bilingual mode -- Select this option for processing PO files that are configured as bilingual files. That is where the msgid entry contains the text of the source language and is used as identifier, and where the msgstr entry contains the target text (or a copy of the source text, or is empty). Bilingual files are, by far, the most used format.

msgid "Cannot open the input file."
msgstr "Fichier d'entrée non trouvé."

When reading in bilingual mode the source text is the text found for the msgid string, the target text is the text found in the msgstr string.

Create resname value from the hash code of the source text -- Set this option to create an alpha-numeric resname value generated from the hash code of the source text of each entry. If this option is not selected the resname value is the a sequential number corresponding to the order of the entries in the file. This options is enabled only when the Bilingual mode option is selected.

Important:

Keep in mind that hash code values are values generated from the text. Depending on the .NET library (or even sometimes the version of the library) that is used the values may be different, causing the same file to have different values depending on which tools or version of the tool is used.

Monolingual mode -- Select this option for processing PO files that are configured as monolingual files. That is where the msgid entry contains an abstract identifier rather than the text of an original language, and where the source text is in the msgstr entry.

msgid "IDS_CANTOPENFILE"
msgstr "Cannot open the input file."

When reading in monolingual mode, the source text is the text found in msgstr and no target text is assumed. So, if a file as entries in different languages they will all be assumed to be source. When writing out, the target text will replace the source text.

Use localization directives when they are present -- Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored.

Extract items outside the scope of localization directives -- Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set.

See the Localization Directives section for more details on how the filter deals with directives.

Parameters - Inline Codes Tab

Mark as inline codes the text parts matching this regular expression -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. By default the expression is:

((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn])
|(\\a|\\b|\\f|\\n|\\r|\\t|\\v))

This matches the C-style printf variables (e.g. "%s", "%2.3f", "%04X", "%1$d", etc.) and the escaped sequences: "\r\n", "\a", "\b", "\f", "\n", "\r", "\t", and "\v".

Edit Expression -- Click this button to edit the regular expression and its options.

See the Regular Expressions section for more information about the syntax and rules for building regular matching patterns.

Parameters - Output Tab

Reformat in multiple lines the msgstr entries that have \n markers -- Set this option to generate multi-line msgstr entries for the cases where the string contains "\n".

Note that this option has no effect when the output is set to RTF.

Credits

Special thanks to Asgeir Frimannsson for his information on the PO format, for some sample files, and for the detailed work on the "XLIFF Representation Guide for Gettext PO", a complete cookbook on the best practices to represent PO files in XLIFF.

You can find a link to the guide as well as a PO-to-XLIFF filter on the XLIFF Tools project Web site.