Dataset

A dataset comprising video words for each of the ten scripts will be provided. The dataset contains at least 1000 words (Colour) for each script extracted from various sources (news, sports etc.) The dataset is divided into three parts,


  1. Training set (60%)
  2. Validation set (10%)
  3. Test Set (30%),

Sample Dataset



Train Dataset:

The link to download the training dataset will be send to the registered participants



Validation Dataset:

The link to download the validation dataset will be send to the registered participants



Test Dataset:

The link to download the Test dataset will be send to the registered participants



Dataset Usage:

The CVSI 2015 dataset contains video words of 10 different scripts/languages, namely, English, Hindi(Devnagari), Bengali(Bangla), Oriya, Gujrathi, Punjabi, Kannada, Tamil, Telegu, and Arabic.


The dataset is made publicly available ONLY for RESEACH PURPOSES. Use of dataset for commercial purposes/applications is not allowed. The use of dataset requires citing the following:


Nabin Sharma, Ranju Mandal, Rabi Sharma, Umapada Pal, and Michael Blumenstein, "ICDAR2015 Competition on Video Script Identification (CVSI 2015)", In Proc. 13th International Conference of Document Analysis and Recognition (ICDAR 2015), pp. 1196-1200, 2015, [pdf].