AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild

Audio samples for "AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild"

Abstract

Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, the presence of off-screen speakers, etc. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. The proposed AFL-Net incorporates dynamic lip movement as an additional modality to enhance the identity distinction of each segment. Besides, unlike AVR-Net which extracts high-level representations from each modality independently, AFL-Net employs a two-step cross-attention mechanism to sufficiently fuse different modalities, resulting in more comprehensive information to enhance the identity discrimination. Moreover, we also incorporated a masking strategy during the training process, where the face and lip modalities are randomly obscured. This strategy enhances the impact of the audio modality on the system outputs. An ablation study was conducted to confirm the effectiveness of each contribution. Moreover, experimental results demonstrate that the proposed system outperforms state-of-the-art baselines, such as the AVR-Net and DyViSE.

Method

Codes will be available upon acceptance

Demo

AFL-Net (ours)	AVR-Net (baseline)	Some analysis
		In this video, a conversation unfolds between two individuals. Both models accurately identify the first two sentences. However, the baseline model stumbles upon the third sentence, potentially due to the confusion arising from the simultaneous appearance of both individuals' facial information on the screen. Conversely, AFL-Net correctly recognizes the sentence, a success that could be attributed to the incorporation of an additional lip movement modality.
		In this video segment, the AVR-Net mistakenly identifies the second individual as the first throughout the entire segment, potentially due to their similar visual appearances. However, the proposed AFL-Net accurately recognizes the second individual, thereby validating the effectiveness of fusion strategy among different modalities.
		In this video, there's an instance where the second speaker is off-screen. The AVR-Net incorrectly identifies this speaker as the first one, whereas the proposed AFL-Net accurately recognizes the speaker. This could be because the proposed model is trained to rely more heavily on the audio modality, which is directly linked to the speaker's identity.