Okapi Components - Term Extraction Utility

- Overview
- Common Parameters
- Options - Options Tab
- Options - Word Lists Tab
- Credits

Overview

- The utility set identifier for this utility is: oku_set02
- The utility identifier is: termextraction

The Term Extraction utility allows you to extract a list of terms and their frequency from one or more input files.

A term is a sequence of one or more words that does not starts or end with a stop-word. You can specify how the maximum number of words per term, as well as the list of stop-words.

Note: This utility is currently limited to languages using scripts where words are delimited by spaces (so: useable for English, Russian, or Arabic, but not usable for Chinese, Thai, or Japanese).

TODO: - break term sequence using punctuation, line-break
- number handling

Common Parameters

The common parameters are the options specified from the application calling the utility rather than in the options dialog box of the utility itself. For this utility the common parameters you need to specify are the following:

Files of the first input list	- Needed
Root for the first input list	- Not Needed
Files of the second input list	- Not Needed
Root for the second input list	- Not Needed
Files of the third input list	- Not Needed
Root for the third input list	- Not Needed
Input language	- Needed
Output language	- Not Needed
Input default encoding	- Needed
Output default encoding	- Not Needed
Location and names for output files	- Not Needed

Options - Options Tab

The option available are:

Output file to generate -- Enter the full path of the output file to generate.

Open automatically the output file when the task is done -- Set this option to open automatically the output file when the task is done.

Minimum number of words per term -- Enter the minimum number of words each term can have. The value must be between 1 and the value for Maximum number of words per term.

Maximum number of words per term -- Enter the maximum number of words each term can have. The more words per term, the longer the extraction will take. The value must be between the value for Minimum number of words per term and 7.

Minimum occurrences for output -- Enter the minimum number of occurrences a term should have to be listed in the output file. Any term with a number of occurrences less than the specified value will not be output. The value must be between 1 and 100.

Sort the terms by frequency (most frequent first) -- Set this option to generate the output terms in their frequency order (the most frequent terms being listed first). If this option is not set, the list is sorted alphabetically, using the collation for the specified input language.

Preserve case differences -- Set this option to keep the case differences. If this option is set "Term" and "term" will be seen as two different words, if this option is not set they will be seen as the same word.

Options - Word Lists Tab

List of stop words -- Enter the full path of the file that contains the list of stop words to use for the extraction. Stop words are words that stops the creation of a term.

List of not-starting words -- Enter the full path of the file that contains the list of words that should not start a term. Not-starting words are words that do not appear at the beginning of a term (but they can appear within a term or at the end it).

List of not-ending words -- Enter the full path of the file that contains the list of words that should not end a term. Not-ending words are words that do not appear at the end of a term (but they can appear within a term or at the beginning of it).

For these three files: you can select a given file by using the small browse button at the end of the corresponding input field. And you can edit a given file by clicking the Edit button under the corresponding input field.

Words List File Format

The stop words lists, not-starting words lists, and the not-ending words lists are stored in text files in the following format:

The file must be in UTF-16 or UTF-8.
All words must be lowercase.
Each word must be on its own line.
It must be a file with DOS/Windows line-breaks (CR+LF)
Lines with only white-spaces or empty are ignored.
Lines where the first non-white-space character is a '#' (ASCII 0x23) are comments.

Example:

# Example of list
and
of
or
in

Credits

Special thanks to Jaroslaw Michalak and Frank Kuhnke for helping with improving this utility.