Robotic assistive or rehabilitative devices are promising aids for people with neurological disorders as they help regain normative functions for both upper and lower limbs. However, it remains challenging to accurately estimate human intent or residual efforts non-invasively when using these robotic devices. In this article, we propose a deep learning approach that uses a brightness mode, that is, B-mode, of ultrasound (US) imaging from skeletal muscles to predict the ankle joint net plantarflexion moment while walking. The designed structure of customized deep convolutional neural networks (CNNs) guarantees the convergence and robustness of the deep learning approach. We investigated the influence of the US imaging’s region of interest (ROI) on the net plantarflexion moment prediction performance. We also compared the CNN-based moment prediction performance utilizing B-mode US and sEMG spectrum imaging with the same ROI size. Experimental results from eight young participants walking on a treadmill at multiple speeds verified an improved accuracy by using the proposed US imaging + deep learning approach for net joint moment prediction. With the same CNN structure, compared to the prediction performance by using sEMG spectrum imaging, US imaging significantly reduced the normalized prediction root mean square error by 37.55% ($ p $ < .001) and increased the prediction coefficient of determination by 20.13% ($ p $ < .001). The findings show that the US imaging + deep learning approach personalizes the assessment of human joint voluntary effort, which can be incorporated with assistive or rehabilitative devices to improve clinical performance based on the assist-as-needed control strategy.