Sistema de reconhecimento de fala disártrica usando aprendizagem autossupervisionada

Gracelli, Ricardo Alexandre

Sistema de reconhecimento de fala disártrica usando aprendizagem autossupervisionada

Arquivos

Dissertação de Mestrado_Ricardo_Gracelli.pdf (13.98 MB)

Data

2025-09-08

Autores

Gracelli, Ricardo Alexandre

Editor

Universidade Federal de São Carlos

Resumo

This study aims to develop and evaluate Automatic Speech Recognition (ASR) systems tailored to the needs of individuals with dysarthric speech — a condition that compromises communication clarity and limits the use of voice-based assistive technologies. One of the main challenges in dysarthric speech recognition lies in the scarcity of labeled data, and to address this issue, two complementary and interdependent approaches were examined. The first investigated pathology-oriented data augmentation techniques applied to two Transformer-based processing pipelines: FW1 and FW2. Signal perturbation methods — additive noise, time-stretching, and the proposed Spectral Oclution (SO) — were applied individually and in combination to recordings from the UA-Speech corpus. Exploratory analysis of Word Recognition Accuracy (WRA), Word Error Rate (WER), and Character Error Rate (CER) curves showed that combining noise and time-stretching consistently reduced errors in speakers with moderate intelligibility; SO provided additional gains in specific cases; and the union of all three perturbations benefited severe cases, although it could degrade nearly typical voices due to spectral overheating. The second approach, built upon the findings of the first, employed supervised pre-training on typical speech from the LJSpeech corpus. These baseline models were then subjected to a self-supervised contrastive cycle on the typical partition of UA-Speech, where the same transformations from Phase 1 (noise, time-stretching, and SO) were reused to generate positive pairs for contrastive training. Thus, augmentation strategies were not only validated in isolation but also served as the foundation for the contrastive stage. We evaluated two methods: Simple Framework for Contrastive Learning of Visual Representations (SimCLR) and Swapping Assignments between Views (SwAV). The refined weights were then transferred for fine-tuning on dysarthric datasets. Compared to the baseline trained solely on LJSpeech, both contrastive methods enhanced performance: SimCLR showed higher sensitivity to severe speakers, while SwAV maintained stable performance across all intelligibility levels, further reducing CER and WER and increasing WRA. In some severe cases, WER was reduced by more than 20 percentage points. In summary, the integration of targeted data augmentation and contrastive pre-training resulted in ASR models more robust to the articulatory variability of dysarthria, supporting the inclusion of dysarthric speakers in voice-based communication systems.

Palavras-chave

Disartria, Reconhecimento, Fala, Visão, Computacional, Aprendizado, Profundo, Autossupervisionado, Contrastivo, Dysathria, Speech, Recognition, Computer, Vision, Deep, Learning, Contrastive, Self-Supervised

Citação

GRACELLI, Ricardo Alexandre. Sistema de reconhecimento de fala disártrica usando aprendizagem autossupervisionada. 2025. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22760.

URI

https://hdl.handle.net/20.500.14289/22760

Coleções

Teses e Dissertações

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial 3.0 Brazil

Página do item completo

Sistema de reconhecimento de fala disártrica usando aprendizagem autossupervisionada

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons