Okapi ComponentsTrados Text Filter |
|
- Overview |
The Trados Text Filter is an Okapi component that implements the Okapi Filter Interface for Trados Text translation memory files. See the Trados Web site for more information on the Trados tools.
The following is an example of a very simple Trados Text TM file. The source text is marked in blue bold. and the target text is marked in green bold.
Not implemented yet: The filter currently does handle only code in internal style, not the RTF-only code like \b or other RTF objects in the segment. Those are stripped out.
<RTF Preamble> <FontTable> {\fonttbl {\f1 \fmodern\fprq1 \fcharset0 Courier New;} {\f2 \fswiss\fprq2 \fcharset0 Arial;}} <StyleSheet> {\stylesheet {\St \s0 {\StN Normal}} {\St \cs1 {\StB \v\f1\fs24\sub\cf12 }{\StN tw4winMark}} {\St \cs2 {\StB \cf4\fs40\f1 }{\StN tw4winError}} {\St \cs3 {\StB \f1\cf11\lang1024 }{\StN tw4winPopup}} {\St \cs4 {\StB \f1\cf10\lang1024 }{\StN tw4winJump}} {\St \cs5 {\StB \f1\cf15\lang1024 }{\StN tw4winExternal}} {\St \cs6 {\StB \f1\cf6\lang1024 }{\StN tw4winInternal}} {\St \cs7 {\StB \cf2 }{\StN tw4winTerm}} {\St \cs8 {\StB \f1\cf13\lang1024 }{\StN DO_NOT_TRANSLATE}}} </RTF Preamble> <TrU> <CrD>18042005, 15:05:27 <CrU>DC <ChD>18042005, 15:05:27 <ChU>BJ <UsC>13 <Att L=Component>HD <Seg L=EN-US>Some text in {\cs6\f1\cf6\lang1024 <b>}bold{\cs6\f1\cf6\lang1024 </b>}. <Seg L=FR-FR>Du texte en {\cs6\f1\cf6\lang1024 <b>}gras{\cs6\f1\cf6\lang1024 </b>}. </TrU>
The properties for the Trados Text Filter are the following:
Property | This Filter |
---|---|
INPUTFILE | Yes |
INPUTSTRING | No |
BILINGUALINPUT | Yes |
TEXTBASED | Yes |
OUTPUTFILE |
Yes |
OUTPUTSTRING | No |
ANCILLARYOUTPUT | No |
XMLOUTPUT | No |
RTFOUTPUT | No |
USEKEY | No |
ISINDEMOMODE | No |
If the UTF-8 or UTF-16 has been auto-detected as the encoding of the input file, that encoding is used, otherwise the encoding specified by the user is used.
Be careful when selection the input encoding: Trados Text file are bilingual file, and often the correct encoding of the input file is the one corresponding to the target/output language.
The translation unit entries are in RTF format. RTF provides different ways to write out extended characters:
\'hh
where hh
is the hexadecimal
value of the character in the current font encoding\uDDD
where DDD
is the decimal
value of the character in UnicodeThe two last forms can be read correctly without problem, but the raw character form present the problem of switching encoding as the file is read depending of the font: The filter currently does not provide support for this form: raw characters are read using the encoding specified by the user. Note that in the Text TM files for Trados 7 the raw characters are in UTF-8, and do not have that issue of possible multiple encodings within the same file.
The encoding of the output is the one specified by the user, except when UTF-8 or UTF-16 has been auto-detected as the input encoding. In that case, the output encoding used is the same as the input encoding.
When writing out extended character the filter uses the latest RTF syntax
that utilizes both the hexadecimal and the Unicode escape mechanism. This allows
faster and safer read. Each entry starts with a \ucN
commend to
reset the number of hexadecimal characters to skip if reading Unicode values.
This is made necessary because of a bug in the way Trados reads the Text TM in
all version of Trados before version 7: the \ucN
value is not reset
to its normal default and Trados ends up reading both escape forms resulting in
duplicated extended characters everywhere in the imported TM.
The user-specified source language code is checked against the first
<Seg>
found. A warning is generated if they are not identical, but the
process continues. In the same way, the user-specified target language code is
checked against the second <Seg>
found. A warning is generated if
they are not identical, but the process continues.
The date and time information in Trados Text TMs is in local time. Ideally the filter should try to convert the TM date/time to UTC, however because the entries cover dates that can be in both daylight saving periods and non-daylight periods using a single time difference information may result in onerous data anyway. In addition as the TM circulate between translators, the entries may be in different local times. Overall since there is no certainty that a conversion would be accurate the filter does not make any.
As a result, the filter simply assumes the date and time information is in UTC.
This filter implement support for a few item properties that give you access to the all information set for each translation unit entry. These properties are:
CreationDate |
This corresponds to the <CrD>
field. The property value is in the format "yyyyMMddTHHmmssZ ". |
CreationUser |
This corresponds to the <CrU>
field. |
ChangeDate |
This corresponds to the <ChD>
field. The property value is in the format "yyyyMMddTHHmmssZ ". |
ChangeUser |
This corresponds to the <ChU>
field. |
UsageCount |
This corresponds to the <UsC>
field. |
UsageDate |
This corresponds to the <UsD>
field. The property value is in the format "yyyyMMddTHHmmssZ ". |
@A Name |
Attribute with pick-list, with Name
the name of the attribute. This corresponds to the <Att L=Name>
fields. |
@T Name |
Attribute with text value, with
Name the name of the attribute. This corresponds to the
<Txt L=Name> fields. |
Note that all these fields may not be available for all entries. The
IFilterItem
method GetProperty()
return null when
the property does not exist. You can use the ListProperties()
method to get a semi-colon delimited list of all properties in the current
filter item.
TODO: Recognize the restype, resname, and flag attributes.
Special thanks to Gerrit Sanders and Jean-Christophe Helary for their help with Trados version 7.