Geir Thore Berge, Ole-Christoffer Granmo, Tor Oddbjørn Tveit, Morten Goodwin, Lei Jiao, Bernt Viggo Matheussen
Medical applications challenge today's text categorization techniques by demanding both high accuracy and ease-of-interpretation. Although deep learning has provided a leap forward in regard to accuracy, this leap comes at the sacrifice of interpretability. In this paper, we introduce a text categorization approach that leverages the recently introduced Tsetlin Machine to address this accuracy-interpretability challenge. Briefly, we represent the terms of a text as propositional variables. From these variables, we capture categories using simple propositional formulae, such as: IF “rash” AND “reaction” AND “penicillin” THEN Allergy. The Tsetlin Machine learns these formulae from labeled text, utilizing conjunctive clauses to represent the particular facets of each category. Therefore, also the absence of terms (negated features) can be used for categorization purposes. Our empirical comparisons with Naïve Bayes classifiers, decision trees, linear support vector machines (SVMs), random forest, long short-term memory (LSTM) neural networks, and other techniques, are quite conclusive. Using relatively simple propositional formulae, the accuracy of the Tsetlin Machine either outperforms or performs approximately on par with the best evaluated methods on both the 20 Newsgroups and IMDb datasets, as well as on a clinical dataset containing authentic electronic health records (EHRs). On average, the Tsetlin Machine delivers the best recall and precision scores across the datasets. The main merit of the proposed approach is thus its capacity for producing human-interpretable rules, while at the same time achieving acceptable accuracy. We believe that our novel approach can have a significant impact on a wide range of text analysis applications, providing a promising starting point for deeper natural language understanding with the Tsetlin Machine.