Okapi Components - HTML Filter

- Overview
- Filter Properties
- Processing Details
- Parameters - Options Tab
- Parameters - Elements Tab
- Parameters - Inline Codes Tab
- Parameters - Output Tab
- Credits

Overview

This filter is only in ALPHA stage, use it for test only, not for real projects. Use the HTML Filter of Rainbow v4 instead.

The HTML Filter is an Okapi component that implements the Okapi Filter Interface for HTML documents.

The following is an example of a simple HTML file.

<html> 
 <head>
  <meta http-equiv="Content-Language" content="en-us">
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
  <title>Okapi Components - HTML Filter</title>
 </head>
 <body>
  <p>This is a simple paragraph with some formatting like a word
in <strong>bold</strong>.</p>
 </body>
</html>

Filter Properties

The properties for the HTML Filter are the following:

Property	This Filter
INPUTFILE	Yes
INPUTSTRING	No
BILINGUALINPUT	No
TEXTBASED	Yes
OUTPUTFILE	Yes
OUTPUTSTRING	No
ANCILLARYOUTPUT	No
XMLOUTPUT	No
RTFOUTPUT	Yes
USEKEY	No
ISINDEMOMODE	No

Processing Details

Input Encoding

The filter decides which encoding to use for the input file using the following logic:

The filter checks for Byte-Order-Mark (BOM) at the beginning of the file:
- If a BOM is found: The encoding used is the one corresponding to the BOM form, and the "auto-detected" flag is set.
- Otherwise: The encoding used is the default input encoding specified by the user.
TODO: XML encoding for XHTML cases
When a <meta> element with its http-equiv attribute set to "content-type" is found, the filter checks for a charset declaration in the content attribute.
- If one is found and the option Ignore the encoding declaration in the input file is not set:
  - If the current encoding was auto-detected and is different from the encoding specify in the charset declaration, a warning message is issued and the processing continue with the auto-detected encoding.
  - Otherwise: The new encoding is used for processing the rest of the file.
- Otherwise: The encoding is not modified.

Output Encoding

The encoding of the output file is the one specified by the user. If there is a charset declaration found in the file, the encoding value of that declaration is changed in the output to reflect the output encoding.

Localization Directives

TODO: NOT SUPPORTED YET

The filter supports localization directives. They are special comments you can use to override the default behavior of the filter regarding the parts to extract. Such directives have little use in a HTML file as the format is already geared toward localization, so the processing of localization directives is not turned on by default. Set the option Use localization directives when they are present to turn it on.

The syntax and behavior of the directives are the same across all Okapi filters. See the Localization Directives pages for detail information about what you can do with the mechanism.

Line-Breaks

TODO

Parameters - Options Tab

Ignore the encoding declaration in the input file -- Set this option to ignore the encoding declaration inside the input file (in a <meta> element), and use either the user-specified encoding, or the encoding auto-detected using the Byte-Order-Mark.

Use localization directives when they are present -- Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored. NOT IMPLEMENTED YET

Extract items outside the scope of localization directives -- Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set. NOT IMPLEMENTED YET

See the Localization Directives section for more details on how the filter deals with directives. Note also that if your HTML documents are XHTML, the preferred way to use localization directives is to use the Internationalization Tag Set (ITS) markup and process the files with the XML Filter.

Use Do-Not-Localize list if a DNL file is present -- Set this option to enable the filter to utilize any Do-Not-Localize list file found along with a given input file. The DNL file has the path and name as the input file, with an additional .dnl extension. It contains a list of entries that should not be extracted. Each entry is made of the resname, restype and text of a filter item. Use the DNL List Editing utility to create and maintained DNL files.

Parameters - Elements Tab

Inline elements -- List all the elements that should be treated as part of the text (for example <b>, <strong>, etc.). Each element must be its name only (no brackets) and in lowercase. Separate the elements with spaces or line-breaks.

Any element not listed here is treated as non-inline and breaks the text in several translation units. For example, given the following HTML code:

<p>This is <strong>an important aspect</strong> of life.</p>

If <strong> is listed as inline, you will get one translation unit:

"This is <1>an important aspect</1> of life."

If <strong> is not listed as inline, you will get three translation units:

"This is "
"an important aspect"
" of life."

Use Defaults -- Click this button to use the default list of inline elements.

Parameters - Inline Codes Tab

Mark as inline codes the text parts matching this regular expression -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. By default the option is not set, and the the expression is:

((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn])
|(\\a|\\b|\\f|\\n|\\r|\\t|\\v))

This matches the C-style printf variables (e.g. "%s", "%2.3f", "%04X", "%1$d", etc.) and the escaped sequences: "\r\n", "\a", "\b", "\f", "\n", "\r", "\t", and "\v".

Edit Expression -- Click this button to edit the regular expression and its options.

See the Regular Expressions section for more information about the syntax and rules for building regular matching patterns.

Parameters - Output Tab

Force the non-breaking spaces to   -- Set this option to output   for all non-breaking space characters.

Unwrap and normalize white-spaces in translatable content -- Set this option to reduce, in translatable text content, all sequences of white-spaces (including line-breaks) to a single space character. If this option is not set, all white-spaces and line-breaks in the text are preserved.

Unwrap and normalize white-spaces inside inline tags -- Set this option to reduce, in the tag of all inline elements, all sequences of white-spaces (including line-breaks) to a single space character.