Preparing Data
The first step in using text categorization is preparing your input file and your test
file. For this, you need to structure the data as tab separated values in both the
files. The files need to have details in this format:
- UFT-8 encoding
- Tab-separated data in two columns, where the first column contains the category name (for example: "Patient" or "Provider") and the second column has the data for each category (as displayed in the example below)
Your data should look as:
Patient John Smith dob04181963 224 Main St. Atl GA 30311
Provider Mark Johnson M.D. NPI5489512047 412 Washington Atl GA 30301