At Maritz Research we use Text Analytics to do auto-categorization of open-ended survey question responses. One of the unsung heroes of our Text Analytics solutions for auto-categorization is …drum roll please… Spell Checking/Correction! Which is why we highlighted it in a presentation at the Text Analytics Summit in June, and will be featuring it again in a presentation at the Text Analytics World conference in October.
This unglamorous tool in the Text Analytics toolbox can make a huge difference in the results of auto-categorization, whether it is rules-based (e.g., our solutions for survey comment auto-categorization), machine-learning-based, or a hybrid (e.g., our solutions for social media auto-categorization ).
For rules-based solutions, automatically correcting misspellings helps you keep your rules simple, clean, and maintainable. If you have a category, for example, that is looking for people talking about courteous employees, your rule can be something like this:
(staff OR employee) AND courteous
Instead of something like this:
(staff OR employee) AND (courteous OR courtious OR curteous OR curtious OR …)
Imagine that list going on and on, for hundreds of versions of “courteous”. That’s right – we’ve identified hundreds of different ways that survey respondents creatively spell “courteous”. And now imagine multiple categories that are looking for “courteous”; each one with rules that include that enormous list of “courteous” spellings. By automatically correcting misspellings before the text gets to your rules, you can focus your rules on the logic needed, not on all the various ways respondents like to spell the words that are important to you.
Along with more impressive-sounding tools and techniques like regular expressions, part of speech tagging, naïve bayesian classifiers, etc., the automatic detection and correction of misspellings can play a critical part of an auto-categorization solution.
After all, you want to know if your “representives” are “knowledgable”, “corteus”, and “helful”, right? So why let misspellings get in the way?