Sistema de reconhecimento de fala disártrica usando aprendizagem autossupervisionada
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de São Carlos
Resumo
This study aims to develop and evaluate Automatic Speech Recognition (ASR) systems tailored to the needs of individuals with dysarthric speech — a condition that compromises communication clarity and limits the use of voice-based assistive technologies. One of the main challenges in dysarthric speech recognition lies in the scarcity of labeled data, and to address this issue, two complementary and interdependent approaches were examined. The first investigated pathology-oriented data augmentation techniques applied to two Transformer-based processing pipelines: FW1 and FW2. Signal perturbation methods — additive noise, time-stretching, and the proposed Spectral Oclution (SO) — were applied individually and in combination to recordings from the UA-Speech corpus. Exploratory analysis of Word Recognition Accuracy (WRA), Word Error Rate (WER), and Character Error Rate (CER) curves showed that combining noise and time-stretching consistently reduced errors in speakers with moderate intelligibility; SO provided additional gains in specific cases; and the union of all three perturbations benefited severe cases, although it could degrade nearly typical voices due to spectral overheating. The second approach, built upon the findings of the first, employed supervised pre-training on typical speech from the LJSpeech corpus. These baseline models were then subjected to a self-supervised contrastive cycle on the typical partition of UA-Speech, where the same transformations from Phase 1 (noise, time-stretching, and SO) were reused to generate positive pairs for contrastive training. Thus, augmentation strategies were not only validated in isolation but also served as the foundation for the contrastive stage. We evaluated two methods: Simple Framework for Contrastive Learning of Visual Representations (SimCLR) and Swapping Assignments between Views (SwAV). The refined weights were then transferred for fine-tuning on dysarthric datasets. Compared to the baseline trained solely on LJSpeech, both contrastive methods enhanced performance: SimCLR showed higher sensitivity to severe speakers, while SwAV maintained stable performance across all intelligibility levels, further reducing CER and WER and increasing WRA. In some severe cases, WER was reduced by more than 20 percentage points. In summary, the integration of targeted data augmentation and contrastive pre-training resulted in ASR models more robust to the articulatory variability of dysarthria, supporting the inclusion of dysarthric speakers in voice-based communication systems.
Descrição
Palavras-chave
Citação
GRACELLI, Ricardo Alexandre. Sistema de reconhecimento de fala disártrica usando aprendizagem autossupervisionada. 2025. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22760.
Coleções
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licença Creative Commons
Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial 3.0 Brazil
