查詢結果分析
來源資料
頁籤選單縮合
| 題 名 | 以語音情緒辨識增強深度偽造音訊檢測=Enhancing Deepfake Audio Detection with Speech Emotion Recognition |
|---|---|
| 作 者 | 王智弘; 張明桑; 林曾祥; | 書刊名 | 資訊、科技與社會學報 |
| 卷 期 | 24 2024[民113] |
| 頁 次 | 頁(2)1-(2)23 |
| 分類號 | 312.831 |
| 關鍵詞 | 深度偽造音訊檢測; 語音情緒辨識; 遷移式學習; Deepfake audio detection; Speech emotion recognition; Transfer learning; |
| 語 文 | 中文(Chinese) |
| 中文摘要 | 隨著資訊科技的發展,運算硬體成本降低的同時,軟體技術也不斷受到優化,促使深度學習技術門檻降低並廣泛使用,然而不肖人士也透過深度偽造(Deepfake)技術來偽造音訊,製作出不實的影音片段並散布,導致深度偽造的影音影響人們的生活甚鉅,所以深度偽造音訊檢測的重要性與日俱增。同樣地,在深度學習模型的助益之下,語音情緒辨識的研究蓬勃發展,學者可以透過神經網路萃取音訊中的情緒特徵,並用於情緒類別的辨識任務。本研究建置基準和增強兩種系統:基準系統是將深度偽造音訊資料集轉換成梅爾頻率倒譜係數(Mel-Frequency Cepstral Coefficients,MFCCs)後輸入XGBoost(eXtreme Gradient Boosting)分類器進行分類任務並記錄效能,包含英文的ASVSPOOF(Automatic Speaker Verification Spoofing and Countermeasures Challenge)資料集和中文的CFAD(A Chinese Dataset for Fake Audio Detection)資料集;增強系統則是先訓練出語音情緒辨識系統,該語音情緒辨識系統是以EfficientNet模型為基礎進行遷移式學習,加入瑞爾森情緒語言與歌曲視聽資料集(Ryerson Audio-Visual Database of Emotional Speech and Song,RAVDESS)微調,以產生可達成 -2- 語音情緒辨識任務的語音情緒辨識系統,並加入基準系統之中,將梅爾頻率倒譜係數加上所萃取的情緒特徵向量,再輸入XGBoost分類器進行分類任務並記錄效能。實驗結果顯示,增強系統在各項評估指標中勝過基準系統,顯示加入語音情緒辨識系統可以增強深度偽造音訊檢測系統。 |
| 英文摘要 | With the advancement of information technology and the simultaneous reduction in computational hardware costs, continuous optimization of software technologies has facilitated the widespread use of deep learning techniques by lowering entry barriers. However, unscrupulous individuals exploit deepfake technology to manipulate audio, creating deceptive audiovisual content and disseminating misinformation. The profound impact of deepfake audiovisual content on people's lives underscores the escalating importance of detecting such deceptive audio. Concurrently, leveraging deep learning models, research in the field of speech emotion recognition has flourished. Scholars can extract emotional features from audio using neural networks for emotion categorization tasks. This study establishes two systems: a baseline system and an enhanced system. The baseline system transforms deepfake audio datasets into MFCCs(Mel-Frequency Cepstral Coefficients) and utilizes an XGBoost (eXtreme Gradient Boosting) classifier for classification tasks. Performance is recorded using the ASVSPOOF (Automatic Speaker Verification Spoofing and Countermeasures Challenge) dataset in English and the CFAD (A Chinese Dataset for Fake Audio Detection) dataset in Chinese. The enhanced system, on the other hand, first trains a speech emotion recognition system based on the EfficientNet model through transfer learning. This system is fine-tuned with the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The resulting speech emotion recognition system is integrated into the baseline system by combining Mel Frequency Cepstral Coefficients with extracted emotional feature vectors. The combined input is then classified using the XGBoost classifier, and performance metrics are recorded. Experimental results demonstrate that the enhanced system outperforms the baseline system across various evaluation metrics, indicating that the integration of a speech emotion recognition system enhances the capability of deepfake audio detection systems. |
本系統中英文摘要資訊取自各篇刊載內容。