The segmentation engine provided with the Okapi Framework uses regular expressions to define where breaks should occur and to specify exceptions. You can find more information about pattern matching in the Regular Expressions section of the help.
Its mechanism is based on the Segmentation Rule eXchange format (SRX). See the SXR Specification pages for more information.
NOTE: The SRX specifications are currently under-going some clarifications. The current Okapi implementation is therefore not stable.
There are two types of rules: Breaks and Exceptions. Both define two regular expression patterns that correspond to the pattern of text just before and after the point of the text where the break should be made, or avoided.
Type: Break Before: [\.?!]+ After: \s
The rule displayed above define a rule that indicate a break should be made at the point where you find just before: a period, a question mark or an exclamation point, and just after: a white-space. The break would occur just before the white-space:
This is a test. A simple one.
The order in which the break rules are defined is important.
Leading and trailing white-spaces are not included in the segments, but you should take them in account in your rules. For example, the text above with the rule above will result in the following two segments:
[This is a test.] [A simple one.]
An exceptions rule defines a text pattern that occurs at the same place as a break but prevent the break. Exceptions are checked when a break has been found. If at least one of the exceptions matches the text just at the location of the break, the break is not made, and the engine looks for the next break. Consider the following text:
Hello Mr. Jones. I hope you are well today.
The the break rule we have defined earlier, the segments would be the following:
[Hello Mr.] [Jones.] [I hope you are well today.]
To avoid breaking after "
Mr." we can define an exception rule:
Type: Exception Before: [Mm][Rr]\. After: \s
This means the exception rule has a match at the same position as one of the breaks:
Breaks: Hello Mr. Jones. I hope you are well today. Exceptions: Hello Mr. Jones. I hope you are well today.
And therefore we get the following segments:
[Hello Mr. Jones.] [I hope you are well today.]
Note that the exception pattern is made here to work with "
but also: "
MR.", and "
Some text may contain inline codes. For example an HTML paragraph may include
formatting such as
<b> tags, software strings may have
variable placeholders such
%s, etc. In all cases,
the inline codes can occur anywhere within the text, including exactly at the
break points of the segmentation rules.
The Okapi segmenter uses the FilterItem object to store the text to process. In that format, all inline codes are abstracted into three basic codes: opening, closing and isolated.
There are two things to take in consideration with regard to inline codes:
If an inline code occurs between the pattern before the break and after the break should it be included in the left segment or in the right segment (for left-to-right scripts)?
This is a <b>test.</b> A simple one.
The segmenter uses three options for deciding where inline codes go. You can change the behavior as needed. By default:
So with the default settings the text broken down will be:
[This is a <b>test.</b>] [A simple one.]
In the Segmentation Rules Editor you can change the default settings in the
Groups And Options dialog box. To test your rules with inline codes,
use the generic tags
</x> for opening and
closing codes, and
<x/> for isolated codes.
When breaking the text into different segments, some opening or closing codes may become orphans, loosing their closing or opening corresponding code.
If you choose the option Add codes missing after segmentation in the Groups And Options dialog box, the segmenter will automatically add closing or opening codes at the end or the beginning of the segments as needed. Otherwise, any opening and closing code that becomes orphan because of the segmentation is changed to an isolated code. The following example shows these different modifications (highlighted in bleu):
Original text: <i>This is an <b>example.</b> A simple one.</i> With added codes: 1=[<i>This is a <b>example.</b></i>] 2=[<i>A simple one.</i>] Without modifications (isolated codes): 1=[<i>This is a <b>example.</b>] 2=[A simple one.</i>]
Note that adding codes may not always be the better choice as when the segments are put back together any added code will remain. So, for instance, in the case shown above, once the two segments are put back together the text would look like:
<i>This is an <b>example.</b></i> <i>A simple one.</i>
This may or may not be acceptable. It is up to the users to decide which is the best solution for them.
The Segmentation Rules Editor allows you to add, remove, and edit rules in a given SRX file. It provides you with an interface to test the rules one by one or together, and see the result of the segmentation process on a sample text you can modify.
Rules for the following language group -- Select the group of rules to modify. To add and remove groups, click the Groups And Options button.
Language -- Click this button to set the current group of rules based on a language code.
Groups And Options -- Click this button to open the Groups And Options dialog box.
Add -- Click this button to add a new rule to the group.
Edit -- Click this button to modify the current rule.
Remove -- Click this button to remove the current rule from the list.
Test -- Click this button to apply all the rules that are checked to the sample text. The result of the test are displayed in the Result box at the bottom of the window. The segments are displayed according the format that is currently selected. You do not need to click this button if the option Test automatically after a modification is set.
To test your rules with inline codes, use the generic tags
</x> for opening and closing codes, and
Move Up -- Click this button to move the current rule upward.
Move Down -- Click this button to move the current rule downward.
You can display the results in one of the following formats:
Nis the index of the inline code.
Test automatically after a modification -- Set this option to apply automatically the rules to the sample text. When this option is not set you can use the Test button to do this manually.
Regular Expression Help -- Click this button to get detailed help on the syntax and the usage of regular expressions. The button opens the Regular Expression help section.
Save As -- Use this command to save the current set of rules into a new file. The new file becomes the current file.
Load -- Use this command to load an existing SRX file in the editor.
This dialog box allows you to specify the general segmentation options as well as adding or removing groups and language mapping.
Segment sub-flow content -- Set this option to apply the
segmentation rules to the content of sub-flow entries, for example the value of
ALT attribute of an HTML file.
Add codes missing after segmentation -- Set this option to add closing or opening codes in segments breaking paired codes.
Cascade group rules -- TODO
Include opening inline codes -- Set this option to include the opening inline codes in the left segment (in left-to-right scripts) when the codes are at the location of a break.
Include closing inline codes -- Set this option to include the closing inline codes in the left segment (in left-to-right scripts) when the codes are at the location of a break.
Include isolated inline codes -- Set this option to include the isolated inline codes in the left segment (in left-to-right scripts) when the codes are at the location of a break.
Note: See the Inline Codes section for more information on how to handle inline codes with segmentation.
Validate -- Click this button to check the rules and the mappings.
The language maps allow you to associate a language identifier pattern with a group of rules.
When using the SRX rules, the segmentation engine asks for a language
identifier (for example:
fr-CA). To know what group(s) of rules
should be used with that language, the engine looks up the list of language
.*" which will match any language identifier.
Add -- Click this button to add a new mapping to the list.
Edit -- Click this button to change the pattern or the mapped group for the mapping currently selected.
Remove -- Click this button to remove the mapping currently selected.
Move Up -- Click this button to move the mapping currently selected up in the list.
Move Down -- Click this button to move the mapping currently selected down in the list.
Segmentation rules are organized into groups.
Add -- Click this button to add a new group of rules, without any rules.
Clone -- Click this button to add a new group of rules that is a copy of the group currently selected.
Rename -- Click this button to rename the group currently selected.
Remove -- Click this button to remove the group currently selected. Note that you cannot delete a group if it is the only one in the list.
Special thanks to David Pooley, Rodolfo Raya, and Martin Wunderlich for their examples, help, and suggestions with SRX.