Okapi ComponentsTerm Extraction Utility |
|
- Overview |
- The utility set identifier for this utility is: oku_set02
- The utility identifier is: termextraction
The Term Extraction utility allows you to extract a list of terms and their frequency from one or more input files.
A term is a sequence of one or more words that does not starts or end with a stop-word. You can specify how the maximum number of words per term, as well as the list of stop-words.
Note: This utility is currently limited to languages using scripts where words are delimited by spaces (so: useable for English, Russian, or Arabic, but not usable for Chinese, Thai, or Japanese).
TODO: - break term sequence using
punctuation, line-break
- number handling
The common parameters are the options specified from the application calling the utility rather than in the options dialog box of the utility itself. For this utility the common parameters you need to specify are the following:
Files of the first input list | - Needed |
Root for the first input list | - Not Needed |
Files of the second input list | - Not Needed |
Root for the second input list | - Not Needed |
Files of the third input list | - Not Needed |
Root for the third input list | - Not Needed |
Input language | - Needed |
Output language | - Not Needed |
Input default encoding | - Needed |
Output default encoding | - Not Needed |
Location and names for output files | - Not Needed |
The option available are:
Output file to generate -- Enter the full path of the output file to generate.
Open automatically the output file when the task is done -- Set this option to open automatically the output file when the task is done.
Minimum number of words per term -- Enter the minimum number of words each term can have. The value must be between 1 and the value for Maximum number of words per term.
Maximum number of words per term -- Enter the maximum number of words each term can have. The more words per term, the longer the extraction will take. The value must be between the value for Minimum number of words per term and 7.
Minimum occurrences for output -- Enter the minimum number of occurrences a term should have to be listed in the output file. Any term with a number of occurrences less than the specified value will not be output. The value must be between 1 and 100.
Sort the terms by frequency (most frequent first) -- Set this option to generate the output terms in their frequency order (the most frequent terms being listed first). If this option is not set, the list is sorted alphabetically, using the collation for the specified input language.
Preserve case differences -- Set this option to keep the case differences. If this option is set "Term" and "term" will be seen as two different words, if this option is not set they will be seen as the same word.
List of stop words -- Enter the full path of the file that contains the list of stop words to use for the extraction. Stop words are words that stops the creation of a term.
List of not-starting words -- Enter the full path of the file that contains the list of words that should not start a term. Not-starting words are words that do not appear at the beginning of a term (but they can appear within a term or at the end it).
List of not-ending words -- Enter the full path of the file that contains the list of words that should not end a term. Not-ending words are words that do not appear at the end of a term (but they can appear within a term or at the beginning of it).
For these three files: you can select a given file by using the small browse button at the end of the corresponding input field. And you can edit a given file by clicking the Edit button under the corresponding input field.
The stop words lists, not-starting words lists, and the not-ending words lists are stored in text files in the following format:
Example:
# Example of list and of or in
Special thanks to Jaroslaw Michalak and Frank Kuhnke for helping with improving this utility.