Okapi Components - Encoding Conversion Utility

Overview

- The utility set identifier for this utility is: oku_set01
- The utility identifier is: encodingconversion

The Encoding Conversion utility allows you to convert the input file from one encoding to another.

Any text-based files can be processed with this utility. It is designed to work only with ASCII-based encodings (i.e. encodings that have the same 126 first code-points the same as ASCII, ISO-646).

The utility does not use any filter for the processing the files. The filter association is used only for helping identifying XML or HTML files as described below.

XML Documents

The utility recognizes XML documents...

if they have a .xml extension,
or if they have an XML declaration (i.e. <?xml...?>) in the first 100 characters of the file,
or if they are set to use the okf_xml filter.

Note that the search is done by regular expressions without knowledge of XML comments. Therefore if a commented declaration exists before the real declaration, that commented declaration will be used, not the real one.

If the file is identified as a XML document:

Its input encoding will be inferred from the standard XML encoding identification mechanism (Except if the option Ignore auto-detection of the input encoding option is set. In that case the input encoding specified by the user, not the one found in the file, will be used).
Its encoding declaration will be updated to reflect the new encoding (or added if the input file did not have an encoding declaration).

HTML Documents

The utility recognizes HTML documents...

if they have a .htm, or .html extension,
or if they are set to use the okf_html filter.

If the file is identified as a HTML document:

If a charset declaration exists within the first 500 characters of the input file, it will be used as input encoding (Except if the option Ignore auto-detection of the input encoding option is set. In that case the input encoding specified by the user, not the one found in the file, will be used).
If a charset declaration exists within the first 500 characters of the input file, it will be updated to reflect the new encoding. Otherwise, if a <head> or a <html> element exist in that area, a charset declaration will be inserted.

Note that the search for the charset declaration is done by regular expressions without knowledge of HTML comments. Therefore if a commented declaration exists before the real declaration, that commented declaration will be used, not the real one.

Other Files

All other files are treated as plain-text files. Note that you may have to manually update encoding declaration inside some files (RC, PO, etc.) as the utility will not change it automatically.

Common Parameters

The common parameters are the options specified from the application calling the utility rather than in the options dialog box of the utility itself. For this utility the common parameters you need to specify are the following:

Files of the first input list	- Needed (the files to convert)
Root for the first input list	- Not Needed
Files of the second input list	- Not Needed
Root for the second input list	- Not Needed
Files of the third input list	- Not Needed
Root for the third input list	- Not Needed
Input language	- Not Needed
Output language	- Not Needed
Input default encoding	- Needed
Output default encoding	- Needed
Location and names for output files	- Needed

Options - Option Tab

Escape all extended characters -- Set this option to use an escape sequence for each extended character. If this option is not set, only characters not supported by the output encoding will be escaped.

Loose the character -- Select this option to leave alone any characters to escape. By default, un-supported characters will be lost and replaced by a question mark ("?") or by a close ASCII character (for example: "a" for "à", "á", "â", "ã", etc.)

Java-style escape sequence -- Select this option to write the escaped characters using the Java-style notation. For example, the character "à" would be written "\u00e0".

Numeric character reference in hexadecimal -- Select this option to write the escaped characters using the hexadecimal NCR notation. For example, the character "à" would be written "à".

Numeric character reference in decimal -- Select this option to write the escaped characters using the decimal NCR notation. For example, the character "à" would be written "à".

Character entity reference -- Select this option to write the escaped characters using character entity references when available, or the hexadecimal NCR notation when no character entity is defined for the given un-supported character. For example, the character "à" would be written "à" and the character "Ł" would be written "Ł".

User-defined escape sequence -- Select this option to write the escaped characters using the pattern of your choice. If this option is selected, you must enter in the following edit box the formatting pattern to use.

The pattern must be a valid formatting pattern as defined for the String class in C#. If the pattern is invalid you will get an error such as "Input string was not in a correct format" when processing the file. The pattern must be set for a integer parameter (the Unicode 16-bit code-point of the character).

Here are a few examples of valid patterns:

Pattern	Result for "à"
`{0}`	224
`{0:000000}`	000224
`\u{0:X4}`	\u00E0
`&#x{0:X4};`	à
`&#x{0:x4};`	à
`unsupported->{0:X4}<`	`unsupported->00E0<`

The general syntax for the pattern is: "{0[,alignment][:formatString]}". The most simple patter is "{0}". To specify a literal opening brace ("{") use "{{", for a closing-brace ("}") use "}}". For example, to get the output "${224}" (for an unsupported 'à') specify the pattern "${{{0}}}".

Note that if some characters of the pattern should be escaped when used in one of the output file, you must specified the escaped form in the pattern. For example: If you want to get the output ">00E0<" (for an unsupported 'à') in an HTML or XML file, specify the pattern ">{0:X4}<" and not ">{0:X4}<".

In addition to the standard C# patterns, the utility also supports special patterns:

Pattern	Result for "à"	Description
`@8@`	340	Octal value

Use byte values -- Set this option to use the byte values of the encoded character, rather than the Unicode value, as the value to escape in the user-defined sequence. For example, if the character to escape is 'á', the user-defined format "[{0}]", and the output encoding is UTF-8:

If the option is set, the result is: "[195][161]"
If the option is not set, the result is: "[225]"

For HTML and XML files, always escape to hexadecimal NCRs -- Set this option to always use the hexadecimal NCR notation for escaping characters in HTML and XML files. If this option is not set, the default notation you have selected will be used.

Warn when a character is not supported by the output encoding -- Set this option if you want to get a warning in the Log when a file contains characters that are not supported by the output encoding. The message will also display the number of characters affected. Note that, if this option is set, you will get the warning whether or not you have selected an escape notation.

Ignore auto-detection of the input encoding -- Set this option if you want to force the use of the input encoding you specify instead of allowing the encoding to be automatically detected when possible.

Make the output ASCII-byte safe -- Set this option to detect any multi-byte extended character that has one or more trailing byte looking like an ASCII single-byte. If this option is set and an escape notation has been selected the character will be output in the escape notation instead of its raw form. This problem exists only in few encodings (e.g. Big5) and should only be used in rare cases. This type of output could be use, for example, as the input for a tool not enabled for multi-byte encoding and has problems with the bytes 0x5C seen as '\' characters. However, keep in mind that this cannot replace having the tool correctly enabled for multi-byte encodings.

Un-escape input text -- Set this option to un-escape escaped forms of characters (such as NCRs) that are in the input file. If this option is set, the escaped characters will be converted to raw Unicode values, then these values converted to the output encoding. If this option is not set the escaped forms will be left as they are.