Dataset Card for covost2

Dataset Summary

CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English
and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of
crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.

Supported Tasks and Leaderboards

speech-translation: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md .

Languages

The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese.

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, usually called file, its transcription, called sentence, and the translation in target language called translation.

{'client_id': 'd277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658',
 'file': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
 'audio': {'path': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
           'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
           'sampling_rate': 48000},
 'id': 'common_voice_en_18540003',
 'sentence': 'When water is scarce, avoid wasting it.',
 'translation': 'Wenn Wasser knapp ist, verschwenden Sie es nicht.'}

Data Fields

file: A path to the downloaded audio file in .mp3 format.
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0].
sentence: The transcription of the audio file in source language.
translation: The transcription of the audio file in the target language.
id: unique id of the data sample.

Data Splits

config	train	validation	test
en_de	289430	15531	15531
en_tr	289430	15531	15531
en_fa	289430	15531	15531
en_sv-SE	289430	15531	15531
en_mn	289430	15531	15531
en_zh-CN	289430	15531	15531
en_cy	289430	15531	15531
en_ca	289430	15531	15531
en_sl	289430	15531	15531
en_et	289430	15531	15531
en_id	289430	15531	15531
en_ar	289430	15531	15531
en_ta	289430	15531	15531
en_lv	289430	15531	15531
en_ja	289430	15531	15531
fr_en	207374	14760	14760
de_en	127834	13511	13511
es_en	79015	13221	13221
ca_en	95854	12730	12730
it_en	31698	8940	8951
ru_en	12112	6110	6300
zh-CN_en	7085	4843	4898
pt_en	9158	3318	4023
fa_en	53949	3445	3445
et_en	1782	1576	1571
mn_en	2067	1761	1759
nl_en	7108	1699	1699
tr_en	3966	1624	1629
ar_en	2283	1758	1695
sv-SE_en	2160	1349	1595
lv_en	2337	1125	1629
sl_en	1843	509	360
ta_en	1358	384	786
ja_en	1119	635	684
id_en	1243	792	844
cy_en	1241	690	690

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

CC BY-NC 4.0

Citation Information

@misc{wang2020covost,
    title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
    author={Changhan Wang and Anne Wu and Juan Pino},
    year={2020},
    eprint={2007.10310},
    archivePrefix={arXiv},
    primaryClass={cs.CL}

Contributions

Thanks to @patil-suraj for adding this dataset.

Datasets:
covost2

Dataset Card for covost2

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Models trained or fine-tuned on covost2

facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur

facebook/s2t-wav2vec2-large-en-de

facebook/wav2vec2-xls-r-1b-21-to-en

facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_fr_css10

facebook/wav2vec2-xls-r-1b-en-to-15

facebook/wav2vec2-xls-r-300m-en-to-15