- Overview
- Filter Properties
- Processing Details
- Rewritting
- ITS Processing
- Extensions to ITS
- Parameters - Options Tab
- Parameters - Inline Codes Tab
- Parameters - Output Tab

Overview

IMPORTANT NOTE: The XML Filter is only an BETA version.

The XML Filter is an Okapi component that implements the Okapi Filter Interface for XML documents.

The following is an example of a very simple XML document. The source text is underlined. The information about what parts of the file are extractable or not is described in the ITS Processing section.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
 <prolog>
  <author>Zebulon Fairfield</author>
  <version>version 12, revision 2 - 2006-08-14</version>
  <keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
  <storageKey>articles-6D272BA9-3B89CAD8</storageKey>
 </prolog>
 <body>
  <title>Appaloosa</title>
  <p>The Appaloosas are rugged horses originally breed by 
the <kw>Nez-Perce</kw> tribe in the US Northwest.</p>
  <p>They are often characterized by their spotted coats.</p>
 </body>
</myDoc>

Filter Properties

The properties for the XML Filter are the following:

Property This Filter
INPUTFILE Yes
INPUTSTRING Yes
BILINGUALINPUT Yes
TEXTBASED Yes
OUTPUTFILE Yes
OUTPUTSTRING Yes
ANCILLARYOUTPUT No
XMLOUTPUT No
RTFOUTPUT Yes
USEKEY No
ISINDEMOMODE No

Processing Details

Input Encoding

The filter uses the XML encoding declaration mechanism to guess automatically the encoding of the input file:

The encoding defined by the user in the call to the filter is not used at all.

Output Encoding

The filter uses the encoding specified by the user for the output file. If the file has already an encoding declaration it is updated, otherwise (if the option Add XML declaration if needed is set) an encoding declaration is added.

TODO: Special case for XHTML files.

Input and Output Languages

The filter ignores any language information (i.e. xml:lang attributes), except if explicitly part of an ITS rule. And the filter does not modify language information when in output.

Rewriting

When writing out the processed XML document, the filter may make some changes to the document:

XML Declaration

If the original document does not have an XML declaration (i.e. <?xml version=... ?>), one is automatically added if the option Add XML declaration if needed is set.

Attribute Formatting

Any formatting characters (multiple white-spaces, line-breaks, etc.) in the starting or empty tags are removed and attributes are reformatted. For example, the following source document:

<para   attr1   = "123"
        attr2 =   'abc'
>Text</para>

gets re-written as:

<para attr1="123" attr2="abc">Text</para>

Attribute Quotes

Attribute values are delimited by either double-quotes or apostrophes regardless what delimiter was used in the original document. You can select either character through the Quote for attributes option.

For example if you select to use double-quotes:

<Elem attr1='value1' attr2="value2">...

gets re-written as:

<Elem attr1="value1" attr2="value2">...

Empty Elements

Empty elements are all re-written the same way, but you can control how. They can be re-written as <elem/>, <elem />, or <elem></elem>. You can also set exceptions to your main choice. See the Output Tab parameters for more information.

White Spaces

White spaces in the content of elements that have a xml:space="preserve" attribute are always preserved. Note also that you can use the ITS extension rule whiteSpaces to specify that an element should have its whitespaces preserved.

White spaces in the content of non-extractable elements are preserved regardless whether they have a xml:space="preserve" or not.

Other content has its white spaces preserved or not according the corresponding information found in the ITS file used to process the document, except if the Preserve as many white-spaces as possible option is set, in which case, as many white spaces as possible are preserved.

Extra line-breaks may be added at specific places if the Ensure line-breaks after extractable non-inline elements option is set.

ITS Processing

The XML Filter supports the ITS features necessary for correctly accessing translatable text inside XML documents. ITS (Internationalization Tag Set) is a W3C namespace created to help internationalizing XML schemas and documents. The latest ITS specification is available here: http://www.w3.org/TR/its/. See also this introduction and quick reference.

The XML Filter implementation of ITS is as follow:

Features \ Selection Global Local
Translate Yes Yes
Directionality No No
Terminology No No
Element Within Text Yes N/A
Ruby No No
Localization Note Yes Yes

When processing a document, the filter...

  1. Assumes that all element content is translatable, and none of the attribute values are translatable.
  2. Applies the global rules found in the ITS file associated with the input file using the parameters file.
  3. Applies the global rules found in the document.
  4. And finally, applies the local rules within the document.

For example: Assuming that ITSForDoc.xml is the ITS file associated with the input file Document.xml, the translatable text is listed below.

ITSForDoc.xml:

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:translateRule selector="//head" translate="no"/>
 <its:withinTextRule selector="//b|//code|//img" withinText="yes"/>
</its:rules>

Document.xml:

<doc>
 <head>
  <update>2006-12-25</update>
  <author>Mirabelle McIntosh</author>
 </head>
 <body>
  <p>Paragraph with <img ref="eg.png"/> and <b>bolded text</b>.</p>
  <p>Paragraph with <code>data codes</code> and text.</p>
 </body>
</doc>

Translatable text and inline codes:

1: "Paragraph with <x id='1'/> and <g id='2'>bolded text</g>."
2: "Paragraph with <g id='1'><x id='2'/></g> and text."

Extensions to ITS

In addition to the standard ITS features, the filter also supports a few extensions designed to work with ITS. These extensions are in the namespace identified by the URI "okapi-framework:its-extensions" (prefixed "ext" in this document). The following extension are available:

The idPointer Extension

The ext:idPointer attribute in an its:translateRule element specify the node where is located the ID of the item to extract. The value must be an XPath expression relative to the position of the node selected by the selector attribute of the its:translateRule element.

This rule applies only to the nodes selected by the selector attribute. It is not inherited by their children nodes or by their attributes.

If the XPath expression is prefixed with a character $, the ID uses the name of the node pointed instead of its value. If the XPath expression is followed by a character |, the computed value of the ID has the text after the character | used as a suffix. For example, the following rules:

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0"
 xmlns:ext="okapi-framework:its-extensions">
 <its:translateRule selector="//*" translate="no"/>
 <its:translateRule selector="//name/@text" translate="yes" ext:idPointer="../id|Name"/>
 <its:translateRule selector="//description/@text" translate="yes" ext:idPointer="../id|Desc"/>
 <its:translateRule selector="//*/@msg" translate="yes" ext:idPointer="$."/>
</its:rules>

Applied on:

<doc>
 <item>
  <id>001</id>
  <label text="Text of the label"/>
  <description text="Text of the description"/>
 </item>
 <MSG_WARNING_001 msg='Some text to translate'/>
</doc>

Will result in:

<trans-unit id="1" resname="001Name">
 <source>Text of the label</source>
</trans-unit>
<trans-unit id="2" resname="001Desc">
 <source>Text of the description</source>
</trans-unit>
<trans-unit id="3" resname="MSG_WARNING_001">
 <source>Some text to translate</source>
</trans-unit>

Note that the example above is an extreme case: You should avoid using translatable attributes and element names that are IDs.

The whiteSpaces Extension

The ext:whiteSpaces attribute in an its:translateRule element specifies how to treat white-spaces. If the value is set to "preserve", all the white-spaces within the selection are preserved. The value "default" indicates that the white-spaces are treated as the processing application decides.

This property does not applies to attributes. To override this property locally, use the xml:space attribute.

TODO: inheritence for xml:space not implemented.

The targetPointer Extension

The ext:targetPointer attribute in an its:translateRule element specifies that the extracted text must be merged in the specified node. The value must be an XPath expression relative to the position of the node selected by the selector attribute of the its:translateRule element.

If the target node does not already exist, the filter tries to create it. Note that some more complex XPath expressions may not be able to lead to the creation of a new node. For example ext:targetPointer="../target" is fine, but something like ext:targetPointer="../item[3]" will not work if the node does not exist yet.

Important: the targetPointer extension currently does not work properly if the content of the target node has inline elements and the target text is already present.

If a content different from the text to merge exists already in target node, the option Overwrite existing target content is used to know whether the new text should be written or not.

For example, the following document has the source text in the <source> element, and the merged text should go in the <target> element.

<document xmlns:its="http://www.w3.org/2005/11/its">
 <head>
  <its:rules version="1.0" xmlns:ext="okapi-framework:its-extensions">
   <its:withinTextRule withinText="yes" selector="//b"/>
   <its:translateRule translate="no" selector="//target"/>
   <its:translateRule translate='yes' selector="//msg/source"
    ext:targetPointer="../target" ext:idPointer="../@name"/>
  </its:rules>
 </head>
 <body>
  <msg name="msg1">
   <source>First message.</source>
 </msg>
  <msg name="msg2">
   <source>Text of <b>msg2</b>.</source>
   <target/>
  </msg>
 </body>
</document>

The cdataOutput Extension

TODO

Parameters - Options Tab

Localization Properties

Use auto-detection -- Select this option to have the localization properties information determined from the auto-detection mechanism of the XML Filter.

TODO

Use the following ITS file -- Select this option to force the XML file to be processed using the specified ITS file.

Whenever possible you should use macro instead of hard-coded path. Macros and case-sensitive. The following macros are available:

If no folder is specified but only a filename, the ITS is expected to be in the same folder as the current parameters file. This is often the preferred way to declare the ITS file.

Edit -- Click this button to edit the ITS file currently specified.

New -- Click this button to create a new ITS file in a specified folder, to make it the one to use, and to edit it. The default file created is based on a template located in the Okapi Shared folder (usually C:\Program Files\Okapi\Shared) and called ITS-Template.xml. You can customize it if needed.

Check -- Click this button to verify if the current ITS file is well-formed and to perform a basic validation of the ITS markup. Note that this check will not catch all possible issues. Possible syntax errors are reported in the Log.

ITS Help -- Click this button to get an introduction and a quick reference on ITS.

Ignore DTD -- Set this option to ignore any DTD declaration at the top of the XML document. This feature is used for example when you have to process a file with a DTD declaration but do not have access to the given DTD.

Use a custom resolver for DTD -- Set this option and enter the full path of the resolver file if you want to use a custom resolver instead of the DTD file possibly specified in the input documents. This feature is used for example when you have to process a file with a DTD declaration but do not have access to the given DTD. You can create your own list of entity declarations to resolve any entity references in the document.

A default resolver is provided (XmlDefaults.ent) and contains the entity declarations of the traditional HTML characters entities. For performance reasons, it is often recommended to use this resolver rather than the XHTML Web-based DTDs.

Parameters - Inline Codes Tab

Mark as inline codes the text parts matching this regular expression -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. No default the expression is defined.

Edit Expression -- Click this button to edit the regular expression and its options.

See the Regular Expressions section for more information about the syntax and rules for building regular matching patterns.

Parameters - Output Tab

Preserve as many white-spaces as possible -- Set this option to output the XML document with as much as possible of the white spaces found in the input document. Note that some white spaces will be normalized no matter inside the tags, but when this option is set, all white spaces inside extractable content are preserved even if the ITS settings associated with the files call to normalize the content of some elements.

Ensure line-breaks after extractable non-inline elements -- Set this option to ensure that there is at least one line-break after the closing tags of all extractable non-inline elements (if no other white-space already exists there). This option is useful when the original file has no line-break at all (e.g. the content.xml document in many OpenOffice .sxw files). Documents that are all in a single line may be difficult to work with in some context.

Add XML declaration if needed -- Set this option to automatically add an XML declaration processing instruction (e.g. <?xml version="1.0"?>)at the top of the output document if one does not exist yet.

Overwrite existing target content -- Set this option to allow any existing text to be overwritten when the content of a target node is different from the text to merge. This is used with the targetPointer ITS extension.

Quote for attributes -- Select the type of quote to use when writing out the attributes values. You have the choice between "Double-quote (")" and "Apostrophe (')". If you select to use double-quotes any double-quote in an attribute value is escaped to &quot; and apostrophes are not escaped. If you select to use apostrophe any double-quote in an attribute value is escape as &quot; and any apostrophe is escaped as &#39;. This behavior allow better compatibility with XML and XHTML parsers. It is recommended to use double-quotes.

Escape '>' into &gt; -- Select this option to always escape the character '>' into the predefined character reference &gt;.

Write empty elements with an end tag -- Set this option to output empty elements in the as a start-tag + end-tag notation (i.e. <elem></elem>). If this option is not set, empty elements are written in its short notation (i.e. <elem/>).

Write empty elements with an ending space -- Set this option to output empty element with a space between the tag name and the closing backslash (i.e. <elem />). If this option is not set no ending space is added (i.e. <elem/>). This is useful for backward compatibility with SGML parsers in some cases.

Exception list for the end tag option -- Enter the name of the elements for which the  Write empty elements with an end tag option should be applied conversely. That is, if the option is set the elements listed here will not have an end tag, if the option is not set the elements listed here will have an end tag. The names are case-sensitive. The names must be separated by a space, a comma or a semi-colon.