Configuring Options
This involves creation of a Training Options file that contains information about your model and the options to be applied for training the model. This file must be in XML format with UFT-8 encoding and must include these header and the required training features:
Header in the Training Options File
The header mentions details of the model, its type, and the path of the test and input files.
- modelName: Name of the model
- modelType: The type of model (which is TC, meaning text categorization in this case)
- modelDescription: Description of the model
- inputFilePath: Location of the input file used for training the model
- testFilePath: Location of the test fileNote:algorithm: The machine learning algorithm used for training the model (default is MaxEnt)
The test file measures the effectiveness of a model. It determines the behavior of the custom model with various training parameters. As a best practice you should use different input and test files in training or evaluating your custom models.
Training Features
These are the training features you can use to create a new category.- Linguistic feature: To specify the language properties
- Stemming: Reduces words to their stem, or root.
For example, "insurer", "insured", and "insures" can all be reduced to
the root
"insure".
<trainingFeature> <featureName>Stemming</featureName> </trainingFeature>
- Stemming: Reduces words to their stem, or root.
For example, "insurer", "insured", and "insures" can all be reduced to
the root
"insure".
- Keyword features: To define the list of keywords
- IgnoreWords: Also known as stop words, this
feature filters out common words that have no effect on categorization,
such as "the", "and", and "but". These words should be separated only by
a comma, not by spaces. You can also use the
Append key with this feature, which when set
to "True", will be added to the existing list of
stopwords.
<trainingFeature> <featureName>IgnoreWords</featureName> <featureParams> <entry> <key>WordList</key> <value> and,the,for,with,still,tri,rep,cust,keep,get,req,call </value> </entry> <entry> <key>Append</key> <value>True</value> </entry> </featureParams> </trainingFeature>
CategoryKeywords
: Identifies a category for a list of keywords belonging to multiple custom lists. For example, Weekdays in CategoryKeywords list contains Keywords as Monday, Tuesday, Wednesday, Thursday, and Friday.This feature can optionally specify if the match should be case sensitive. When used, the default is
true
.<trainingFeature> <featureName>CategoryKeywords</featureName> <featureParams> <entry> <key>Weekdays</key> <!-- List of weekdays --> <value>Monday,Tuesday,Wednesday,Thursday,Friday</value> </entry> <entry> <key>WeekendDays</key> <!-- List of weekend days --> <value>Saturday,Sunday</value> </entry> <entry> <key>CaseSensitive</key> <value>True</value> </entry> </featureParams> </trainingFeature>
KeyWords
: Searches for words that you have specified as belonging to a custom list, such as DaysOfWeek or Month. Also optionally specifies whether the match should be case sensitive, which, when used, has "true" as default.<trainingFeature> <featureName>KeyWords</featureName> <featureParams> <entry> <key>KeyWordList</key> <value>Monday,Tuesday</value> </entry> <entry> <key>CaseSensitive</key> <value>False</value> </entry> </featureParams> </trainingFeature>
- IgnoreWords: Also known as stop words, this
feature filters out common words that have no effect on categorization,
such as "the", "and", and "but". These words should be separated only by
a comma, not by spaces. You can also use the
Append key with this feature, which when set
to "True", will be added to the existing list of
stopwords.
- Lexical feature: To specify the lexeme properties
- NGram: Searches for a portion of a longer string,
with "n" representing the number of words to look for. For example, if
you are looking for the phrase "to be or "not to be", you might search
for a unigram of "to" or "be", or a bigram of "to be" or "or not", or a
trigram of "to be or" or "not to be".
<trainingFeature> <featureName>NGram</featureName> <featureParams> <entry> <key>Count</key> <value>3</value> </entry> </featureParams> </trainingFeature>
- NGram: Searches for a portion of a longer string,
with "n" representing the number of words to look for. For example, if
you are looking for the phrase "to be or "not to be", you might search
for a unigram of "to" or "be", or a bigram of "to be" or "or not", or a
trigram of "to be or" or "not to be".
<trainingOptions>
<modelName>modelone</modelName>
<modelType>TC</modelType>
<modelDescription>modelOne</modelDescription>
<inputFilePath>C:/SpectrumIE/textclassification/train_Input.csv</inputFilePath>
<testFilePath>C:/SpectrumIE/textclassification/train_Test.txt</testFilePath>
<algorithm>SVM</algorithm>
<trainingFeatures>
<!-- Keyword features -->
<trainingFeature>
<featureName>IgnoreWords</featureName>
<featureParams>
<entry>
<key>WordList</key>
<value>
and,the,for,with,still,tri,rep,cust,keep,get,req,call
</value>
</entry>
<entry>
<key>Append</key>
<value>True</value>
</entry>
</featureParams>
</trainingFeature>
<trainingFeature>
<featureName>CategoryKeywords</featureName>
<featureParams>
<entry>
<key>Category1/key>
<value>CategoryKeyword1,CategoryKeyword2</value>
</entry>
<entry>
<key>Category2/key>
<value>CategoryKeyword3,CategoryKeyword4</value>
</entry>
</featureParams>
</trainingFeature>
<trainingFeature>
<featureName>KeyWords</featureName>
<featureParams>
<entry>
<key>KeyWordList</key>
<value>
jam,misfeed,install,help,mechanical,failure,jam,pc,connection
</value>
</entry>
</featureParams>
</trainingFeature>
<!-- Linguistic feature -->
<trainingFeature>
<featureName>Stemming</featureName>
</trainingFeature>
<!-- Lexical feature -->
<trainingFeature>
<featureName>NGram</featureName>
<featureParams>
<entry>
<key>Count</key>
<value>3</value>
</entry>
</featureParams>
</trainingFeature>
</trainingFeatures>
</trainingOptions>