Big Data, Decision Tree Induction, and Image Analysis for the Discovery of Decision Rules for Colon Examination

Authors: Petra Perner
DIN
IJOER-AUG-2017-21
Abstract

The aim of our research was to develop a method that allows us automatically to discover the decision rules for diagnosing medical images in normal tissue images and images showing a polyp. We used a data set of images that came from an endoscope video system used for colon examination. The data set contains 283 normal tissue images and 61 polyp images. The 283 normal images consist of dark regions and reflection. One must decide if the image shows a polyp or not. This is a two-class problem. The unequal number of the data in the two classes makes our problem to an unbalanced data set problem. The polyps in the images were identified and selected by a “well-trained” medical expert. Based on these medical images, we study the behavior of two different statistical texture descriptors, the co-occurrence matrix-texture descriptor and our novel Random set texture descriptor. We review the theory of both texture descriptors and then we apply them to our medical data set. We used a decision-tree induction method to learn the classification rules based on our tool “Decision Master”. In both cases, for the full unequally distributed data set and for the balanced data set, we achieved the best error rate based the Random-set texture descriptor. The performance of the co-occurrence matrix-texture descriptor was worse. For statistical based texture descriptors large enough texture are necessary that cannot always guaranteed for medical objects. Since the co-occurrence matrix is based on higher order statistic that might be the reason for the worse performance. The results show that decision tree induction and image analysis based on our novel texture descriptor is an excellent method to mine medical images for the decision rules even when the data set is unbalanced, but not only that makes our Random-set based texture descriptor favorable. It also gives a flexible way to describe the appearance of the medical objects in symbolic terms, the computation time is less, and it can be set up as software module that can be flexible used in different systems.

Keywords
Image Analysis Endoscope Images Colon Examination Polyp Images Decision Tree Induction Random Set Texture Descriptor Co-occurrence Texture Descriptor Unbalanced Data Set Problem.
Introduction

The aim of our research was to develop a method that allows us automatically to discover the decision rules for diagnosing images in normal tissue images and images containing a polyp. We used a data set of medical images that came from an endoscope video system used for colon examination. Texture seems to be a powerful tool to describe the appearances of medical objects into normal tissue and polyp´s. Therefore, very flexible and powerful texture descriptors are of importance that allows to recognize the texture and to understand what makes up the texture. Texture seems to become an important role to describe the appearance of different medical and biological objects in images. Patterns on cells in cell images, on fungi images or polyp images can be described by texture.

Different texture descriptors have been developed over the past (Rao 1990). The most used texture descriptor is the wellknown texture descriptor based on the co-occurrence matrix (Haralick et. al 1973). Although it works well on different applications we prefer to use our texture descriptor based on Random sets (Perner et. al 2002) since this descriptor gives us more freedom in describing different textures. In this paper, we compare the two texture descriptors based on a medical data set. Related work on texture description is given Section 2. The theory of the texture descriptors based on the co-occurrence matrix is reviewed in Section 3 and the texture descriptor based on Random sets is reviewed in Section 4. The material and the application of the texture descriptors and the decision-tree induction-method is explained in Section 5. The used data set of polyp images is derived from colon examination. We calculated the texture features based on the two methods for each image of the data set and learn a decision tree classifier. Cross-validation is used to calculate the error rate. Then we compare the properties of the two best decision trees, the runtime for the feature calculation, the selected features, and the semantic meaning of the texture descriptors. The results are presented in Section 6 and they are discussed in Section 7. Conclusions are given in Section 8.

Conclusion

Many texture descriptors are known from the literature (Rao 1990). The most used texture descriptor is the texture descriptor based on the co-occurrence matrix. We proposed a texture descriptor based on Random sets (Perner et. al 2002a). In this paper, we compared both texture descriptors based on a medical-image data set for colon examination. The image should be classified into normal tissue images and into polyp images. We choose a medical application since the appearance of many medical objects can often be nicely described by texture. We learnt a classifier model based on decision tree induction. Then we compared the classification results for both texture descriptors.

We have found that the texture descriptor based on Random sets outperforms the co-occurrence texture descriptor based on the error rate, tree properties and the runtime. Co-occurrence texture descriptor uses fewer features from the set of calculated texture features than the texture descriptor based on Random sets. However, this might only demonstrate that the cooccurrence texture descriptor has limited description power since the error rate is much higher than that for the texture descriptor based on random sets. One reason might be that the medical objects are not so large and the higher-order statistics fail due to the limited number of pixels. The run-time of the Random-set texture descriptor is seven times lower than as for the co-occurrence texture descriptor. This is a big advantage of the Random-set texture descriptor over the co-occurrence texture descriptor since the large computation time of image analysis algorithm is still a problem. The Random-set texture descriptor can form a software module that can be used for different applications and different sizes of the objects.

In addition, the texture descriptor based on Random sets has semantic meanings. An expert can understand the properties of the texture when looking at the slices produced during the calculation of the texture features. Therefore, the different appearances in the slices can be labeled by semantic terms that would give us explanation capability of the different textures.

The unbalanced data set problem as it often appears for medical data sets is handled in our study by sampling two equally distributed data sets together for the two-class problem. If we use this data set we can achieve a higher accuracy for the classification for both texture descriptors but still the Random-set texture descriptor outperforms the co-occurrence matrixtexture descriptor.

Article Preview