Okapi ComponentsPO Filter |
|
- Overview |
The PO Filter is an Okapi component that implements the Okapi Filter Interface for PO (Portable Object) resource files.
The filter is based on the PO specifications found in the GNU gettext manual.
The following is an example of a very simple PO file. The translatable text
is marked in bold. Note also the header information in the first entry
(the one with an empty msgid
line), where encoding and plural
information may be found.
# PO file for myApp msgid "" msgstr "" "Project-Id-Version: myApp 1.0.0\n" "Report-Msgid-Bugs-To: \n" "POT-Creation-Date: 2005-10-02 05:16+0200\n" "PO-Revision-Date: 2005-03-21 11:28/-0600\n" "Last-Translator: unknown <email@address>\n" "Language-Team: unknown <email@address>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" "Content-Transfer-Encoding: 8bit\n" "Plural-Forms: nplurals=2; plural=(n != 1);\n" msgid "diverging after version %d of %s" msgstr "" msgid "You have selected %d file for deletion" msgid_plural "You have selected %d files for deletion" msgstr[0] "" msgstr[1] "" domain ErrorMsg msgid "Cannot find %s." msgstr ""
The properties for the PO Filter are the following:
Property | This Filter |
---|---|
INPUTFILE | Yes |
INPUTSTRING | Yes |
BILINGUALINPUT | Yes |
TEXTBASED | Yes |
OUTPUTFILE | Yes |
OUTPUTSTRING | Yes |
ANCILLARYOUTPUT | No |
XMLOUTPUT | No |
RTFOUTPUT | Yes |
USEKEY | No |
ISINDEMOMODE | No |
The filter decides which encoding to use for the input file using the following logic:
charset
declaration
exists in the first 2048 characters of the file:If the file has a header entry with a charset
declaration, the
declaration is automatically updated in the output to reflect the encoding
selected for the output.
No language information is updated in the PO header entry. If the output language is different from the input language, any language-related information need to be updated manually in the header entry of the generated PO file.
The PO Filter supports localization directives. They are special comments you can use to override the default behavior of the filter regarding the parts to extract. Such directives have little use in a PO file as the format is already geared toward localization, so the processing of localization directives is not turned on by default. Set the option Use localization directives when they are present to turn it on.
The syntax and behavior of the directives are the same across all Okapi filters. See the Localization Directives pages for detail information about what you can do with the mechanism.
Note that any directives within an entry (i.e. anywhere between msgid
and msgstr
or between several msgstr
of a plural form)
are ignored.
Correct entry: #_skip msgid "do not translate" msgstr "do not translate"
Incorrect, the directive will be ignored: msgid "do not translate" #_skip msgstr "do not translate"
The filter supports
plural forms entries with the assumption that they are in a sequential
order. that is, msgstr[0]
comes first, then msgstr[1]
,
etc. All the msgstr
strings of a given plural entry are processed
as part of a single group that has its restype value set to 'x-gettext-plurals
'.
The resname
value generated for plural form entries has an
additional index indicator. For example the following entry with plural form
entry:
msgid "untranslated-singular" msgid_plural "untranslated-plural" msgstr[0] "translated-singular" msgstr[1] "translated-plural-form1" msgstr[2] "translated-plural-form2"
Will generate the items for constructing the following XLIFF block:
<group restype="x-gettext-plurals"> <trans-unit id="1" resname="P3ADE34F0-0" xml:space="preserve" translate="no"> <source xml:lang="en-US">untranslated-plural</source> <target xml:lang="fr-FR">translated-singular</target> </trans-unit> <trans-unit id="2" resname="P3ADE34F0-1" xml:space="preserve" translate="no"> <source xml:lang="en-US">untranslated-plural</source> <target xml:lang="fr-FR">translated-plural-form1</target> </trans-unit> <trans-unit id="3" resname="P3ADE34F0-2" xml:space="preserve" translate="no"> <source xml:lang="en-US">untranslated-singular</source> <target xml:lang="fr-FR">translated-plural-form2</target> </trans-unit> </group>
The domains are supported as groups, with the restype value set to 'x-gettext-domain
'
and the resname value set to the group identifier.
For example, the following entry:
domain TheDomain1 msgid "Text 1 in domain 'TheDomain1'" msgstr "Texte 1 dans le domain 'TheDomain1'"
Will generate the items for constructing the following XLIFF block:
<group resname="TheDomain1" restype="x-gettext-domain"> <trans-unit id="1" resname="N9D1999AB" xml:space="preserve"> <source xml:lang="en-US">Text 1 in domain 'TheDomain1'</source> <target xml:lang="fr-FR">Texte 1 dans le domain 'TheDomain1'</target> </trans-unit> </group>
The line-breaks type of the output are the same as the one of the original input.
Bilingual mode -- Select this option for processing PO files that are configured as bilingual
files. That is where the msgid
entry contains the text of the source language
and is used as identifier, and where the msgstr
entry contains the target text
(or a copy of the source text, or is empty). Bilingual files are, by far, the most used format.
msgid "Cannot open the input file." msgstr "Fichier d'entrée non trouvé."
When reading in bilingual mode the source text is the text found for the
msgid
string, the target text is the text found in the msgstr
string.
Create resname value from the hash code of the source text -- Set this option to create an alpha-numeric resname value generated from the hash code of the source text of each entry. If this option is not selected the resname value is the a sequential number corresponding to the order of the entries in the file. This options is enabled only when the Bilingual mode option is selected.
Important:
Keep in mind that hash code values are values generated from the text. Depending
on the .NET library (or even sometimes the version of the library) that is used the
values may be different, causing the same file to have different values
depending on which tools or version of the tool is used.
Monolingual mode -- Select this option for processing PO files that are configured as monolingual
files. That is where the msgid
entry contains an abstract
identifier rather than the text of an original language, and where the source
text is in the msgstr
entry.
msgid "IDS_CANTOPENFILE" msgstr "Cannot open the input file."
When reading in monolingual mode, the source text is the text found in msgstr
and no target text is assumed. So, if a file as entries in different languages
they will all be assumed to be source. When writing out, the target text will
replace the source text.
Use localization directives when they are present -- Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored.
Extract items outside the scope of localization directives -- Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set.
See the Localization Directives section for more details on how the filter deals with directives.
Mark as inline codes the text parts matching this regular expression -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. By default the expression is:
((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn]) |(\\a|\\b|\\f|\\n|\\r|\\t|\\v))
This matches the C-style printf
variables (e.g. "%s
",
"%2.3f
", "%04X
", "%1$d
", etc.) and the escaped sequences:
"\r\n
", "\a
", "\b
",
"\f
", "\n
", "\r
", "\t
", and "\v
".
Edit Expression -- Click this button to edit the regular expression and its options.
See the Regular Expressions section for more information about the syntax and rules for building regular matching patterns.
Reformat in multiple lines the msgstr entries that have \n markers
-- Set this option to generate multi-line msgstr
entries for the
cases where the string contains "\n
".
Note that this option has no effect when the output is set to RTF.
Special thanks to Asgeir Frimannsson for his information on the PO format, for some sample files, and for the detailed work on the "XLIFF Representation Guide for Gettext PO", a complete cookbook on the best practices to represent PO files in XLIFF.
You can find a link to the guide as well as a PO-to-XLIFF filter on the XLIFF Tools project Web site.