In this paper, a multi-modal classification is proposed for recognizing vandalism against Automatic Teller Machines (ATMs). The visual and textual information base model is developed here to identify external threats on ATMs. The model discriminates threatening behaviors from those that are benign in the image. It provides a level of confidence in the threat recognition by visual object classification coupled with word vector distance measure. To achieve our goal, real-time object detection based on a Region Convolutional Neural Network (R-CNN) first detects objects in the scene and word embedding technique allows to measure distance between the detected object label with predefined tools assumed to be used for vandalizing ATMs. Similarity measure from word embedding not only determines whether the scene may lead to any nefarious activities, but also would provide the level of confidence in occurrence of such incidents. From the experimental evaluation, it is shown that the method is effective and delivers a quantitative measure on decisions it makes.