Healthcare systems have recently utilized Internet of Medical Things (IoMT) to assist intelligent data collection and decision-making. However, the volume of malicious threats, particularly new variants of malware attacks to the connected medical devices and their connected system, have risen significantly in recent years, which poses a critical threat to patients' confidential data and the safety of healthcare system. To address the high complexity of conventional software-based detection techniques, Hardware-supported Malware Detection (HMD) has proved to be efficient for detecting malware at the processors' micro-architecture level with the aid of Machine Learning (ML) techniques applied to Hardware Performance Counter (HPC) data. Existing ML-based HMDs while accurate in recognizing known signatures of malicious patterns, have not been proven to be generalized in detecting unknown zero-day malicious attacks in the new data stream from IoMTs at run-time which is a more challenging problem. In this work, we first mimic the data stream in the real-world operation of IoMT, where each data stream during a period of operations (e.g. days) can contain very different malware and benign type than prior ones. Then we examine the suitability of various standard ML classifiers for zero-day malware detection on a new data stream and demonstrate that such methods are not capable of detecting unknown malware signatures with a high detection rate. Then, to address the challenge of run-time zero-day malware detection, we propose a Deep Reinforcement Learning (DRL) based defense mechanism that dynamically selects the best ML-based defender at run-time customized for each device from a pool of highly efficient models continuously trained on all stream data. Our method first converts tabular data to images, then leverages transfer learning techniques to retrain and enhance the Deep Neural Network (DNN) based model's detection performance. We train the DNN models on various stream data continuously to form a model pool. At last, we train a DRL-based agent constructed with two Multi-Layer Perceptrons (MLPs) one acts as an Actor and another acts as a Critic to align the decision of selecting the most optimistic DNN model during the run-time, which enhances the system performance using a small number of micro-architectural features that are captured at run-time by existing HPCs.