Okapi ComponentsXML Filter |
|
- Overview |
IMPORTANT NOTE: The XML Filter is only an BETA version.
The XML Filter is an Okapi component that implements the Okapi Filter Interface for XML documents.
The following is an example of a very simple XML document. The source text is underlined. The information about what parts of the file are extractable or not is described in the ITS Processing section.
<?xml version="1.0" encoding="utf-8"?> <myDoc> <prolog> <author>Zebulon Fairfield</author> <version>version 12, revision 2 - 2006-08-14</version> <keywords><kw>horse</kw><kw>appaloosa</kw></keywords> <storageKey>articles-6D272BA9-3B89CAD8</storageKey> </prolog> <body> <title>Appaloosa</title> <p>The Appaloosas are rugged horses originally breed by the <kw>Nez-Perce</kw> tribe in the US Northwest.</p> <p>They are often characterized by their spotted coats.</p> </body> </myDoc>
The properties for the XML Filter are the following:
Property | This Filter |
---|---|
INPUTFILE | Yes |
INPUTSTRING | Yes |
BILINGUALINPUT | Yes |
TEXTBASED | Yes |
OUTPUTFILE | Yes |
OUTPUTSTRING | Yes |
ANCILLARYOUTPUT | No |
XMLOUTPUT | No |
RTFOUTPUT | Yes |
USEKEY | No |
ISINDEMOMODE | No |
The filter uses the XML encoding declaration mechanism to guess automatically the encoding of the input file:
The encoding defined by the user in the call to the filter is not used at all.
The filter uses the encoding specified by the user for the output file. If the file has already an encoding declaration it is updated, otherwise (if the option Add XML declaration if needed is set) an encoding declaration is added.
TODO: Special case for XHTML files.
The filter ignores any language information (i.e. xml:lang
attributes), except if explicitly part of an ITS rule. And the filter does not
modify language information when in output.
When writing out the processed XML document, the filter may make some changes to the document:
If the original document does not have an XML declaration (i.e.
<?xml version=... ?>
), one is automatically added if the option
Add XML declaration if needed
is set.
Any formatting characters (multiple white-spaces, line-breaks, etc.) in the starting or empty tags are removed and attributes are reformatted. For example, the following source document:
<para attr1 = "123" attr2 = 'abc' >Text</para>
gets re-written as:
<para attr1="123" attr2="abc">Text</para>
Attribute values are delimited by either double-quotes or apostrophes regardless what delimiter was used in the original document. You can select either character through the Quote for attributes option.
For example if you select to use double-quotes:
<Elem attr1='value1' attr2="value2">...
gets re-written as:
<Elem attr1="value1" attr2="value2">...
Empty elements are all re-written the same way, but you can
control how. They can be re-written as <elem/>
, <elem />
,
or <elem></elem>
. You can also set exceptions to
your main choice. See the
Output Tab parameters for more information.
White spaces in the content of elements that have a xml:space="preserve"
attribute are always preserved. Note also that you can use the
ITS extension rule whiteSpaces to specify
that an element should have its whitespaces preserved.
White spaces in the content of non-extractable elements are
preserved regardless whether they have a xml:space="preserve"
or not.
Other content has its white spaces preserved or not according the corresponding information found in the ITS file used to process the document, except if the Preserve as many white-spaces as possible option is set, in which case, as many white spaces as possible are preserved.
Extra line-breaks may be added at specific places if the Ensure line-breaks after extractable non-inline elements option is set.
The XML Filter supports the ITS features necessary for correctly accessing translatable text inside XML documents. ITS (Internationalization Tag Set) is a W3C namespace created to help internationalizing XML schemas and documents. The latest ITS specification is available here: http://www.w3.org/TR/its/. See also this introduction and quick reference.
The XML Filter implementation of ITS is as follow:
Features \ Selection | Global | Local |
Translate | Yes | Yes |
Directionality | No | No |
Terminology | No | No |
Element Within Text | Yes | N/A |
Ruby | No | No |
Localization Note | Yes | Yes |
When processing a document, the filter...
For example: Assuming that ITSForDoc.xml
is the ITS file
associated with the input file Document.xml
, the translatable text
is listed below.
ITSForDoc.xml:
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0"> <its:translateRule selector="//head" translate="no"/> <its:withinTextRule selector="//b|//code|//img" withinText="yes"/> </its:rules>
Document.xml:
<doc> <head> <update>2006-12-25</update> <author>Mirabelle McIntosh</author> </head> <body> <p>Paragraph with <img ref="eg.png"/> and <b>bolded text</b>.</p> <p>Paragraph with <code>data codes</code> and text.</p> </body> </doc>
Translatable text and inline codes:
1: "Paragraph with <x id='1'/> and <g id='2'>bolded text</g>." 2: "Paragraph with <g id='1'><x id='2'/></g> and text."
In addition to the standard ITS features, the filter also supports a few
extensions designed to work with ITS. These extensions are in the namespace
identified by the URI "okapi-framework:its-extensions
" (prefixed "ext
"
in this document). The following extension are available:
idPointer
ExtensionThe ext:idPointer
attribute in an its:translateRule
element specify the node where is located the ID of the item to extract. The
value must be an XPath expression relative to the position of the node selected
by the selector
attribute of the its:translateRule
element.
This rule applies only to the nodes selected by the selector
attribute. It is not inherited by their children nodes or by their
attributes.
If the XPath expression is prefixed with a character $, the ID uses the name of the node pointed instead of its value. If the XPath expression is followed by a character |, the computed value of the ID has the text after the character | used as a suffix. For example, the following rules:
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:ext="okapi-framework:its-extensions"> <its:translateRule selector="//*" translate="no"/> <its:translateRule selector="//name/@text" translate="yes" ext:idPointer="../id|Name"/> <its:translateRule selector="//description/@text" translate="yes" ext:idPointer="../id|Desc"/> <its:translateRule selector="//*/@msg" translate="yes" ext:idPointer="$."/> </its:rules>
Applied on:
<doc> <item> <id>001</id> <label text="Text of the label"/> <description text="Text of the description"/> </item> <MSG_WARNING_001 msg='Some text to translate'/> </doc>
Will result in:
<trans-unit id="1" resname="001Name"> <source>Text of the label</source> </trans-unit> <trans-unit id="2" resname="001Desc"> <source>Text of the description</source> </trans-unit> <trans-unit id="3" resname="MSG_WARNING_001"> <source>Some text to translate</source> </trans-unit>
Note that the example above is an extreme case: You should avoid using translatable attributes and element names that are IDs.
whiteSpaces
ExtensionThe ext:whiteSpaces
attribute in an its:translateRule
element specifies how to treat white-spaces. If the value is set to "preserve
",
all the white-spaces within the selection are preserved. The value "default"
indicates that the white-spaces are treated as the processing application
decides.
This property does not applies to attributes. To
override this property locally, use the xml:space
attribute.
TODO: inheritence for xml:space not implemented.
targetPointer
ExtensionThe ext:targetPointer
attribute in an its:translateRule
element specifies that the extracted text must be merged in the specified node. The
value must be an XPath expression relative to the position of the node selected
by the selector
attribute of the its:translateRule
element.
If the target node does not already exist, the filter tries to create it.
Note that some more complex XPath expressions may not be able to lead to the
creation of a new node. For example ext:targetPointer="../target"
is fine,
but something like ext:targetPointer="../item[3]"
will not work if the
node does not exist yet.
Important: the targetPointer extension currently does not work properly if the content of the target node has inline elements and the target text is already present.
If a content different from the text to merge exists already in target node, the option Overwrite existing target content is used to know whether the new text should be written or not.
For example, the following document has the source text in the <source>
element, and the merged text should go in the <target>
element.
<document xmlns:its="http://www.w3.org/2005/11/its"> <head> <its:rules version="1.0" xmlns:ext="okapi-framework:its-extensions"> <its:withinTextRule withinText="yes" selector="//b"/> <its:translateRule translate="no" selector="//target"/> <its:translateRule translate='yes' selector="//msg/source" ext:targetPointer="../target" ext:idPointer="../@name"/> </its:rules> </head> <body> <msg name="msg1"> <source>First message.</source> </msg> <msg name="msg2"> <source>Text of <b>msg2</b>.</source> <target/> </msg> </body> </document>
cdataOutput
ExtensionTODO
Use auto-detection -- Select this option to have the localization properties information determined from the auto-detection mechanism of the XML Filter.
TODO
Use the following ITS file -- Select this option to force the XML file to be processed using the specified ITS file.
Whenever possible you should use macro instead of hard-coded path. Macros and case-sensitive. The following macros are available:
<SystemParametersFolder>
: Indicates the ITS file is in the
System Parameters folder.<UserParametersFolder>
: Indicates the ITS file is in the
User Parameters folder.<ProjectParametersFolder>
: Indicates the ITS file is in
the Project Parameters folder.If no folder is specified but only a filename, the ITS is expected to be in the same folder as the current parameters file. This is often the preferred way to declare the ITS file.
Edit -- Click this button to edit the ITS file currently specified.
New -- Click this button to create a new ITS file in a specified
folder, to make
it the one to use, and to edit it. The default file created is based on a template
located in the Okapi Shared folder (usually C:\Program Files\Okapi\Shared
)
and called ITS-Template.xml
. You can customize it if needed.
Check -- Click this button to verify if the current ITS file is well-formed and to perform a basic validation of the ITS markup. Note that this check will not catch all possible issues. Possible syntax errors are reported in the Log.
ITS Help -- Click this button to get an introduction and a quick reference on ITS.
Ignore DTD -- Set this option to ignore any DTD declaration at the top of the XML document. This feature is used for example when you have to process a file with a DTD declaration but do not have access to the given DTD.
Use a custom resolver for DTD -- Set this option and enter the full path of the resolver file if you want to use a custom resolver instead of the DTD file possibly specified in the input documents. This feature is used for example when you have to process a file with a DTD declaration but do not have access to the given DTD. You can create your own list of entity declarations to resolve any entity references in the document.
A default resolver is provided (XmlDefaults.ent
) and contains
the entity declarations of the traditional HTML characters entities. For
performance reasons, it is often recommended to use this resolver rather than
the XHTML Web-based DTDs.
Mark as inline codes the text parts matching this regular expression -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. No default the expression is defined.
Edit Expression -- Click this button to edit the regular expression and its options.
See the Regular Expressions section for more information about the syntax and rules for building regular matching patterns.
Preserve as many white-spaces as possible -- Set this option to output the XML document with as much as possible of the white spaces found in the input document. Note that some white spaces will be normalized no matter inside the tags, but when this option is set, all white spaces inside extractable content are preserved even if the ITS settings associated with the files call to normalize the content of some elements.
Ensure line-breaks after extractable non-inline elements -- Set
this option to ensure that there is at least one line-break after the closing tags of all extractable
non-inline elements (if no other white-space already exists there). This
option is useful when the original file has no line-break at all (e.g. the
content.xml
document in many OpenOffice .sxw files). Documents
that are all in a single line may be difficult to work with in some context.
Add XML declaration if needed
-- Set this option to automatically add an XML declaration processing
instruction (e.g. <?xml version="1.0"?>
)at the top of the
output document if one does not exist yet.
Overwrite existing target content
-- Set this option to allow any existing text to be overwritten when the
content of a target node is different from the text to merge. This is used
with the targetPointer
ITS extension.
Quote for attributes -- Select the
type of quote to use when writing out the attributes values. You have the
choice between "Double-quote (")
" and "Apostrophe (')
".
If you select to use double-quotes any double-quote in an attribute value is
escaped to "
and apostrophes are not escaped. If you
select to use apostrophe any double-quote in an attribute value is escape as
"
and any apostrophe is escaped as '
.
This behavior allow better compatibility with XML and XHTML parsers. It is
recommended to use double-quotes.
Escape '>' into > -- Select this option to always escape the
character '>
' into the predefined character reference >
.
Write empty elements with an end
tag -- Set this option to output empty elements in the as a start-tag +
end-tag notation (i.e. <elem></elem>
). If this option is not set,
empty elements are written in its short notation (i.e. <elem/>
).
Write empty elements with an
ending space -- Set this option to output empty element with a space
between the tag name and the closing backslash (i.e. <elem />
). If
this option is not set no ending space is added (i.e. <elem/>
). This is useful for backward compatibility with SGML parsers
in some cases.
Exception list for the end tag option -- Enter the name of the elements for which the Write empty elements with an end tag option should be applied conversely. That is, if the option is set the elements listed here will not have an end tag, if the option is not set the elements listed here will have an end tag. The names are case-sensitive. The names must be separated by a space, a comma or a semi-colon.