Visual simultaneous localisation and mapping (vSLAM) has shown considerable promise in positioning and navigating across a variety of indoor and outdoor settings, significantly enhancing the mobility of robots employed in industrial and everyday services. Nonetheless, the prevalent reliance of vSLAM technology on the assumption of static environments has led to suboptimal performance in practical implementations, particularly in unstructured and dynamically noisy environments such as substations. Despite advancements in mitigating the influence of dynamic objects through the integration of geometric and semantic information, existing approaches have struggled to strike an equilibrium between performance and real-time responsiveness. This study introduces a lightweight, multi-modal semantic framework predicated on vSLAM, designed to enable intelligent robots to adeptly navigate the dynamic environments characteristic of substations. The framework notably enhances vSLAM performance by mitigating the impact of dynamic objects through a synergistic combination of object detection and instance segmentation techniques. Initially, an enhanced lightweight instance segmentation network is deployed to ensure both the real-time responsiveness and accuracy of the algorithm. Subsequently, the algorithm’s performance is further refined by amalgamating the outcomes of detection and segmentation processes. With a commitment to maximising performance, the framework also ensures the algorithm’s real-time capability. Assessments conducted on public datasets and through empirical experiments have demonstrated that the proposed method markedly improves both the accuracy and real-time performance of vSLAM in dynamic environments.