Okapi ComponentsText Extraction Utility |
|
- Overview |
- The utility set identifier for this utility is: oku_set01
- The utility identifier is: extraction
The Text Extraction utility allows you to separate translatable text from non translatable parts of an input file and put the result in a format that can be used by translation tools.
Several output formats are available: XLIFF, Original+RTF, Table, TMX, XLIFF+RTF, and OmegaT. The output to XLIFF, OmegaT, and (with an option) Table can be merged back into the original format using the Text Merging utility.
The list of input files can contain files that have no associated filters. These files will be simply copied into the Target folder of the package. For example, when preparing a set of HTML files, you can include all images and style-sheets files that are used, they will be copied into the package.
The common parameters are the options specified from the application calling the utility rather than in the options dialog box of the utility itself. For this utility the common parameters you need to specify are the following:
Files of the first input list | - Needed (the files to extract) |
Root for the first input list | - Needed |
Files of the second input list | - Not Needed |
Root for the second input list | - Not Needed |
Files of the third input list | - Not Needed |
Root for the third input list | - Not Needed |
Input language | - Needed |
Output language | - Needed |
Input default encoding | - Needed |
Output default encoding | - Needed |
Location and names for output files | - Not Needed |
Extract the input files into the following format -- Select the format you want to generate. There are several choices:
Format Output -- Click this button to access the dialog box where you can specify the options associated with the output format currently selected, if any are available. For details see: XLIFF Output Options, Table Output Options, TMX Output Options, and XLIFF + RTF Output Options.
XLIFF version -- Select the XLIFF version of the output file.
Include a <target>
element for each
<trans-unit>
-- Set this option to output a <target>
element in
each <trans-unit>
element generated. The content of this
<target>
element
depends on the filter and on the options you select.
The existing translation if available, a copy of the source otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, a copy of the source text is used for the target text.
The existing translation if available, no text otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, the target text is left empty.
A copy of the source text (even if there is a translation available) -- Select this option to always set the target text to the source text, even when a corresponding translation is found in the input file.
Set any <trans-unit>
with existing translation to
translate='no'
-- Set this option to flag all translation units
that have a <target>
element containing an existing translation
extracted from the input file as not translatable.
Include an <alt-trans>
element for each existing
translation -- Set this option to generate an <alt-trans>
element for each entry that has an existing translation. The content of the
<target> element in this <alt-trans>
element is always set to the
target text, regardless what option is specified for the text of the
<target>
element at the <trans-unit>
level.
Include notes when available -- Set this options to generate
<note>
elements when the filter provides such information.
Use placeholder notation (<g></g> and <x/>) -- Set this option
to output the inline codes using XLIFF placeholder elements <g>
and
<x/>
rather than the encapsulating elements like <bpt>
,
<ept>
and <ph>
.
Include word-count -- Set this option to include the word-count
of each item in its corresponding <source>
element.
The output to XLIFF and Table can be merged back into the original format using the Text Merging utility. For more information about the type of XLIFF markup generated by this utility see the section Generated XLIFF. For more information about XLIFF in general see the XLIFF Web site.
The Table output format is a simple output using extracted text as it is, that is without isolating any possible inline codes in special markers unlike the other formats are able to do. This may work or not work depending on the original file format: some filters (like for XML or HTML) may do special escapes when merging back and inline codes "seen" as text as in the Table output format may cause incorrect merging.
In addition, because the raw character Tab, Carriage-Return
and Line-Feed would break the table layout, they are always escaped (Tab=<\T>
,
CR+LF=<\RN>
, CR=<\R>
, and LF=<\N>
. The
file is always in UTF-8 and may have an optional header line. The layout of the
table is the following:
Column 1: The ID of the item.
Column 2: The source text.
Optional next column: The resname of the item (or its ID if there is no resname)
Optional next column: The target text (according the options you specify).
Include a header line -- Set this option to output a line of column titles at the top of the table. This line is mandatory if you want to merge back the extracted text into its original format later one.
Include a column with the resname value -- Set this
option to output, just after the source column, a column containing the value of the resname property (i.e. same as the resname
attribute in XLIFF) for
each text item. If no resname is provided the sequential id value of the item
(same as the id
attribute in XLIFF) is used instead.
Include a last column with the target text -- Set this option to output a target column at the right end of the table. The content of this column depends on the filter and on the options you select.
The existing translation if available, a copy of the source otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, a copy of the source text is used for the target text.
The existing translation if available, no text otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, the target text is left empty.
A copy of the source text (even if there is a translation available) -- Select this option to always set the target text to the source text, even when a corresponding translation is found in the input file.
Include a target <tuv>
element for each
<tu>
-- Set this option to output a <tuv>
element
set to the target language in
each <tu>
element generated. The content of this <tuv>
element
depends on the filter and on the options you select.
The existing translation if available, a copy of the source otherwise -- Select this option to set the target text with the existing translation of the item when it is available. If no corresponding translation is found in the input file, a copy of the source text is used for the target text.
The existing translation if available, no text otherwise -- Select this option to set the target text with the existing translation of the item when it is available. If no corresponding translation is found in the input file, the target text is left empty.
A copy of the source text (even if there is a translation available) -- Select this option to always set the target text with the source text, even when a corresponding translation is found in the input file.
Remove leading and trailing white-spaces -- Set this option to ensure that all leading and trailing white-spaces are removed from the TMX segments. Text without leading and trailing white-spaces is true TMX compliance. Note that this may result in empty entries.
For more information about TMX, see the TMX Web site.
Include a <target>
element for each
<trans-unit>
-- Set this option to output a <target>
element in
each <trans-unit>
element generated. The content of this
<target>
element
depends on the filter and on the options you select.
The existing translation if available, a copy of the source otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, a copy of the source text is used for the target text.
The existing translation if available, no text otherwise -- Select this option to set the target text to the existing translation of the item when it is available. If no corresponding translation is found in the input file, the target text is left empty.
A copy of the source text (even if there is a translation available) -- Select this option to always set the target text to the source text, even when a corresponding translation is found in the input file.
Set any <trans-unit>
with existing translation to
translate='no'
-- Set this option to flag all translation units
that have a <target>
element containing an existing translation
extracted from the input file as not translatable.
Include an <alt-trans>
element for each existing
translation -- Set this option to generate an <alt-trans>
element for each entry that has an existing translation. The content of the
<target> element in this <alt-trans>
element is always set to the
target text, regardless what option is specified for the text of the
<target>
element at the <trans-unit>
level.
Include notes when available -- Set this options to generate
<note>
elements when the filter provides such information.
If you strip out the RTF layer, the resulting XLIFF document can be merged back into the original format using the Text Merging utility. For more information about the type of XLIFF markup generated by this utility see the section Generated XLIFF. For more information about XLIFF in general see the XLIFF Web site.
Extract only the text items with a leading marker -- Set this
option to extract only items that have the text specified in the Marker
edit field at the very beginning. For example "[$TBT$]Text to localize
"
where "[$TBT$]
" is the marker. The marker itself is removed during
the extraction, so the text is presented ready for translation. When this option
is set the text items that have not the marker are set as non-translatable.
Markers -- Enter one or more markers that will be used as a flag
to know what items need to be translated. The text is case-sensitive. If you
have several markers they should be separated by a comma or a semi-colon. All
white-spaces between markers are ignored. (e.g. "[$TBT>; [$TBE>
").
Example of input file with markers. Only the entries with "[$TBT$]
"
will be translated:
# messages.properties v1.3 # Last update: April-26 error.internal.badmsg = Erreur interne. Message invalide. error.filenotfound = [$TBT$]File {0} not found. error.badinst.nojre = Erreur d'installation. itemfound = [$TBT$]Item '{0}' has been found in '{1}'.
Resulting RTF output file (translatable text in black, things not to touch in gray):
# messages.properties v1.3 # Last update: April-26 error.internal.badmsg = Erreur interne. Message invalide. error.filenotfound = File {0} not found. error.badinst.nojre = Erreur d'installation. itemfound = Item '{0}' has been found in '{1}'.
Create a TMX output file with any pre-translated entries found -- Set this option to generate a TMX file that contains all the source+target pairs found during the extraction process. This option is used for bilingual input files such as PO files, where some translations may be already available.
Filename -- Enter the name of the TMX output
file that will be generated. The file will be placed in the package folder. If
you do not specify an extension, the .tmx
extension will be added
automatically. This is a filename, not a path. If you enter a path, only the
filename part will be taken in account. The TMX file generated will overwrite
any existing file with the same name at the given location.
List font information when available -- Set this option to generate a list of the font used (with extracted items) in the processed files. Note that such information may not be available with all filter.
List fonts for each input file -- Set this option to list the fonts for each input file (when it is available). If this option is not set, only a summary of all fonts used at least once in all files is generated. This option is enabled only when the option List font information when available is set.
The two options above work for filters that return a UsedFonts
property when reaching the end of file (i.e. when ReadItem()
return
ENDINPUT
). the property value must be a list of all the fonts used
by translatable items in the file that was processed. The font names must be
separated by a tabulation character.
Add reference information using the following lookup file -- Set this option to add additional information along with each extracted item. The data to associate is found in the tab-delimited file you specify below the check box. This option applies currently only to Original Format + RTF Layer output format.
The tab-delimited lookup file is expected to be in UTF-16 or UTF-8 encoding
and with the following format: The first column is the data, and the second
column the resname
value of the associated item. If an item has
several data, they are all included. Values of the columns should not be between
quotes. Any line without a tab is considered a comment and is skipped.
Generate any ancillary data available -- Set this option to generate any ancillary output the filters may be able to produce. Some file format for example have embedded graphics and their filter may have the capability to create separate files for these graphics.
Output folder -- Enter the path of the folder where the package should be created. The folder and any required parent folders will be created automatically if they do not exist yet.
Package name -- Enter the name of the package to create. A sub-folder of that name will be created in the output folder, and the different output will be generated under that sub-folder.
For both the output folder and the package name you can use the variable
placeholders <SrcLangCode>
and <TrgLangCode>
to
represent the current input language code and the current output language code
(or <SrcLangCodeU>
and <TrgLangCodeU>
for
the codes in uppercases).
The placeholders are replaced by their value when executing the extraction. For
example, the package name Pack_<TrgLangCode>
will be generated as
Pack_fr-FR
if the current output language is defined as fr-FR
.
The generated files will be distributed in the following structure:
[Output Folder]\[Package Name]\Work\<all the extracted files> [Output Folder]\[Package Name]\Original\<all the original files> [Output Folder]\[Package Name]\Target\<all the original non-extractable files>
Copy the original files in the output package -- Set this option
if you want the original files to be copied in the Original
sub-folder created under the package folder.
Create a Rainbow project for merging any XLIFF output -- Set
this option to generate automatically a Rainbow project file for merging back
any of the output files extracted to XLIFF. If this option is set but and no file is
output to a format that can be merged then no Rainbow file is generated. If a Rainbow
file is generated it is placed in the Target
folder with the name
_Merge.rbp
, along with a _Merge.bat
batch file to
execute it.
In addition, all the parameters files used to do the extraction are
copied in this folder as well, and the merging information in each extracted
file points to these parameters files. Note that even if you use default
parameters for the extraction, a parameters file will be generated in the
Target
folder. This ensure that the merging is done with the exact same
parameters as for the extraction, even if some shared parameters files are
modified on your machine between the extraction and the merging.
For the XML Filter parameters files: If the options are using a declared ITS
rules document, the given rules--along with any linked rules--are compiled into a
new rules document that is copied to the Target
folder. The
the parameters file itself is also copied there and is modified to point to this new rules
document. ITS rules documents linked from the original source document are not
touched.
Create an Horizon settings file (.hrs) -- Set this option if you
want to generate a settings file that can be used with Horizon to browse through
the prepared files. When this option is set, the option Include the original
files in the output package is forced and becomes inaccessible. The Horizon
settings file generated will have the same name as the package with a .hrs
extension. This option is useful when extracting the input files into RTF or
XLIFF+RTF.
Note that there are a few compatibility issues with Horizon version 3.x:
Most of the time, these issues do not rise. When they do, corrections can simply be done in Horizon, by specifying the correct information after the Horizon settings file has been generated. In Horizon, select the Settings option from the File menu, then select the Conversion tab, and if necessary, reset the language and encoding information.
Create an OmegaT project file -- Set this option to generate a
project file and the relevant folders for OmegaT.
OmegaT is a open-source
translation tool. When you select this option, the Work
output
folder is called source
, and the Target
output folder
is called target
, the optional TMX output is
placed in the tm
folder, and the empty folders omegat
,
and glossary
are also generated.
Whether they have an RTF layer or not, the XLIFF files generated by the Text Extraction utility have the following characteristics:
The files use a few Okapi-specific attributes that are defined through the
okapi-framework:xliff-extensions
namespace. This namespace usually
uses the okp
prefix.
okp:settings
- Stores the settings string that should be
used to re-process
the original input file. This is always used during the merging process. For
more information about the format of this string, see Filter settings string
parameter of the
LoadSettings method of the Filter Interface. Note that the value of this
attribute depends on whether the option Create a Rainbow project for merging any XLIFF output
is set or not. If it is set, the value points to the parameters file created
in the Target
folder along with the merge project. If it is not
set, the value is the same as the settings used for the extraction.okp:encoding
- Indicates the default encoding of
the original input file. This may or may not be used during the merging. The
value must be an IANA charset name.
The files must have these attributes intact when you merge back the XLIFF documents with the Text Merging utility.
You can add inline codes in the XLIFF documents by using <bpt>
/<ept>
or <ph/>
elements with an id
attribute set to zero.
For example, in the entry below the codes <BOLD>
and </BOLD>
have been added to the entry:
<target xml:lang="fr-FR">This section <bpt id="0"><BOLD></bpt>must<bpt id="0"></BOLD></bpt> be read by the user.</target>
The added content must result in valid codes for the original format (XML, HTML, etc.) once it has been merged back. The tools adding such inline codes are responsible to create the correct codes.
You can remove inline codes in the XLIFF documents by deleting the content of
the relevant <bpt>
/<ept>
or <ph/>
elements. If you delete the content of a <bpt>
element, you must
also delete the content of its corresponding <ept>
element (both
have the same id
value). For example, in the entry below the codes
stored in <bpt>
and <ept>
have been removed.
<target xml:lang="fr-FR">This section <bpt id="1"></bpt>must<bpt id="1"></bpt> be read by the user.</target>
The modified entry must result in a valid entry for the original format once it has been merged back.