Okapi ComponentsHTML Filter |
|
- Overview |
This filter is only in ALPHA stage, use it for test only, not for real projects. Use the HTML Filter of Rainbow v4 instead.
The HTML Filter is an Okapi component that implements the Okapi Filter Interface for HTML documents.
The following is an example of a simple HTML file.
<html> <head> <meta http-equiv="Content-Language" content="en-us"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <title>Okapi Components - HTML Filter</title> </head> <body> <p>This is a simple paragraph with some formatting like a word in <strong>bold</strong>.</p> </body> </html>
The properties for the HTML Filter are the following:
Property | This Filter |
---|---|
INPUTFILE | Yes |
INPUTSTRING | No |
BILINGUALINPUT | No |
TEXTBASED | Yes |
OUTPUTFILE | Yes |
OUTPUTSTRING | No |
ANCILLARYOUTPUT | No |
XMLOUTPUT | No |
RTFOUTPUT | Yes |
USEKEY | No |
ISINDEMOMODE | No |
The filter decides which encoding to use for the input file using the following logic:
<meta>
element with its http-equiv
attribute set to "content-type
" is found, the filter checks for
a charset declaration in the content attribute.The encoding of the output file is the one specified by the user. If there is a charset declaration found in the file, the encoding value of that declaration is changed in the output to reflect the output encoding.
TODO: NOT SUPPORTED YET
The filter supports localization directives. They are special comments you can use to override the default behavior of the filter regarding the parts to extract. Such directives have little use in a HTML file as the format is already geared toward localization, so the processing of localization directives is not turned on by default. Set the option Use localization directives when they are present to turn it on.
The syntax and behavior of the directives are the same across all Okapi filters. See the Localization Directives pages for detail information about what you can do with the mechanism.
TODO
Ignore the encoding declaration in the
input file -- Set this option to ignore the encoding declaration inside
the input file (in a <meta>
element), and use either the
user-specified encoding, or the encoding auto-detected using the
Byte-Order-Mark.
Use localization directives when they are present -- Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored. NOT IMPLEMENTED YET
Extract items outside the scope of localization directives -- Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set. NOT IMPLEMENTED YET
See the Localization Directives section for more details on how the filter deals with directives. Note also that if your HTML documents are XHTML, the preferred way to use localization directives is to use the Internationalization Tag Set (ITS) markup and process the files with the XML Filter.
Use Do-Not-Localize list if a DNL file is present -- Set this
option to enable the filter to utilize any Do-Not-Localize list file found along
with a given input file. The DNL file has the path and name as the input file,
with an additional .dnl
extension. It contains a list of entries
that should not be extracted. Each entry is made of the resname, restype and
text of a filter item. Use the
DNL List Editing
utility to create and maintained DNL files.
Inline elements -- List all the elements that should be treated
as part of the text (for example <b>
, <strong>
, etc.).
Each element must be its name only (no brackets) and in lowercase. Separate the
elements with spaces or line-breaks.
Any element not listed here is treated as non-inline and breaks the text in several translation units. For example, given the following HTML code:
<p>This is <strong>an important aspect</strong> of life.</p>
If <strong> is listed as inline, you will get one translation unit:
This is <1>an important aspect</1> of life.
"If <strong> is not listed as inline, you will get three translation units:
This is
"an important aspect
" of life.
"Use Defaults -- Click this button to use the default list of inline elements.
Mark as inline codes the text parts matching this regular expression -- Set this option to use the specified regular expression to be use against the text of the extracted items. Any match will be converted to an inline code. By default the option is not set, and the the expression is:
((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn]) |(\\a|\\b|\\f|\\n|\\r|\\t|\\v))
This matches the C-style printf
variables (e.g. "%s
",
"%2.3f
", "%04X
", "%1$d
", etc.) and the escaped sequences:
"\r\n
", "\a
", "\b
",
"\f
", "\n
", "\r
", "\t
", and "\v
".
Edit Expression -- Click this button to edit the regular expression and its options.
See the Regular Expressions section for more information about the syntax and rules for building regular matching patterns.
Force the non-breaking spaces to
-- Set this
option to output
for all non-breaking space characters.
Unwrap and normalize white-spaces in translatable content -- Set this option to reduce, in translatable text content, all sequences of white-spaces (including line-breaks) to a single space character. If this option is not set, all white-spaces and line-breaks in the text are preserved.
Unwrap and normalize white-spaces inside inline tags -- Set this option to reduce, in the tag of all inline elements, all sequences of white-spaces (including line-breaks) to a single space character.