Over the last decade, most studies in Computer-Mediated Communication (CMC) have highlighted how online synchronous learning environments implement a new literacy related to multimodal communication. The environment used in our experiment is based on a synchronous audio-graphic conferencing tool. This study concerns false beginners in an English for Specific Purposes (ESP) course, presenting a high degree of heterogeneity in their proficiency levels. A coding scheme was developed to translate the video data into user actions and speech acts that occurred in the various modalities of the system (aural, textchat, text editing, websites). The paper intends to shed further light on and increase our understanding of multimodal communication structures through learner participation and learning practices. On the basis of evidence from an ongoing research investigation into online CALL literacy, we identify how learners use different modalities to produce collectively a writing task, and how the multimodal learning interaction affects the learners' focus and engagement within the learning process. The adopted methodology combines a quantitative analysis of the learners' participation in a writing task with regard to the use of multimodal tools, and a qualitative analysis focusing on how the multimodal dimension of communication enhances language and learning strategies. By looking at the relationship between how the learning tasks are designed by tutors and how they are implemented by learners, that is to say taking into account the whole perception of multimodal communication for language learning purposes, we provide a framework for evaluating the potential of such an environment for language learning.