Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Anonymous submission to Interspeech 2026

Abstract

End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 evaluation protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement approach (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the unprotected baseline, approaching the 50% random-chance ceiling, while Anon-W2W retains over 90% of the baseline sBERT score with sub-second response latency (FRL under 0.8 s).

Index Terms: speaker anonymization, full-duplex speech, privacy, speaker verification, speech agents

Pipeline & Method

Proposed Anonymization Setups

Overview of the SALM-Duplex pipeline and proposed anonymization setups

Figure 2: Overview of the original SALM-Duplex pipeline and proposed anonymization setups. The main diagram shows the ASR-based encoder baseline: an ECAPA-TDNN probe attached to the LLM's hidden states (red dashed path) reveals substantial speaker identity leakage. The Anon-W2W inset (upper left) prepends Stream-Voice-Anon to anonymize the waveform before the unchanged ASR encoder. The Anon-W2F inset (upper right) replaces the ASR encoder with the Stream-Voice-Anon encoder (anonymization active) and fine-tunes the LLM, eliminating the redundant waveform synthesis step.

01

Anon-W2W (wave-to-wave)

Stream-Voice-Anon transforms the raw user waveform into an anonymized waveform before it is fed to the dialogue model. The original encoder is retained. Evaluated on both SALM-Duplex and Moshi.

02

Anon-W2F (wave-to-feature)

The continuous encoder is replaced with a discrete encoder and Stream-Voice-Anon's anonymization module operates natively on discrete tokens, eliminating the redundant waveform synthesis step. Evaluated on SALM-Duplex.

Privacy Probe (attacker)

An ECAPA-TDNN speaker verification model trained from scratch on hidden state representations. Equal Error Rate (EER) is the primary metric; EER = 50% = random chance (perfect anonymization).

⚠ Identity Leakage

Turn-Length Privacy Analysis

Speaker privacy (1 - Linkability) vs. dialogue turn count

Figure 1: Speaker privacy (1 − Linkability) vs. dialogue turn count for discrete encoders (SALM-Duplex and Moshi); higher = better privacy. Red lines: no anonymization; green lines: post-anonymization. Without anonymization, both systems drop into the red zone within a few turns; anonymization lifts privacy into the green zone, though Moshi + W2W shows gradual degradation over dialogue length.

Results

Privacy, Quality & Efficiency

Model User
Encoder
Anon. Privacy Quality Efficiency
EER↑ Lnk↓ sBLEU↑
S2T
sBLEU↑
S2S
sBERT↑
S2T
sBERT↑
S2S
RTFx↑ FRL↓ TTSR↑ Int.L.↓ ISR↑
Moshi discrete 6.40.90 7.186.850.500.48 170.500.851.240.45
discrete W2W 36.90.35 5.205.000.390.38 0.720.721.310.38
SALM-
Duplex
discrete 11.20.79 4.183.150.450.29 2380.440.851.030.96
discrete W2F 41.00.23 3.571.490.390.20 0.500.681.180.93
continuous 28.50.29 6.916.350.590.50 2630.680.970.600.99
continuous W2W 34.60.24 6.465.490.550.45 0.800.940.570.98

Table 1: Privacy, dialogue quality, and efficiency on the VPC2024 evaluation set. Higher EER and lower Linkability = stronger privacy; EER = 50% = perfect anonymization. Best privacy values in bold. Quality: higher sBLEU / sBERT = better. RTFx = 1/RTF; FRL = First Response Latency (s); TTSR = Turn-Taking Success Rate; Int.L. = Interruption Latency (s); ISR = Interruption Success Rate.

⚠ Key Findings

  • Anon-W2F achieves the strongest privacy (41.0% EER, approaching 50% random chance), with the gain entirely from anonymization, not the encoder swap.
  • Anon-W2W provides the largest relative gains on the most exposed systems: +30.5 pp for Moshi vs. +6.1 pp for SALM-Duplex continuous.
  • Anonymization introduces moderate quality degradation (sBERT S2T drops 7–22% relative), but privacy gains consistently outweigh the cost, with EER improving 21–477% relative across setups.
Layer Analysis

Layer-wise EER (%)

System Encoder Anon. Early Mid Late All
Moshi discrete 7.35.66.46.4
discreteW2W 42.537.635.236.9
SALM-
Duplex
discrete 7.514.020.111.2
discreteW2F 43.840.540.141.0
continuous 24.628.632.128.5
continuousW2W 31.533.735.334.6

Table 2: Higher = better privacy; 50% = chance level. All = mean-pooled over all layers. Moshi exhibits uniformly low EER across all layers (5.6–7.3%), while both SALM-Duplex variants show decreasing leakage from early to late layers, consistent with deeper layers progressively abstracting away speaker features.

Demo

Video Demonstration

Video coming soon
A live demonstration of Moshi in full-duplex dialogue mode, with the privacy probe attempting speaker verification before and after W2W anonymization.