: Structure of the Published TIMIT Corpus: The CD-ROM contains doc, train, and test directories at the top level; the train and test directories both have 8 sub-directories, one per dialect region; each of these contains further subdirectories, one per speaker; the contents of the directory for female speaker A fourth feature of TIMIT is the hierarchical structure of the corpus.
The inclusion of speaker demographics brings in many more independent variables, that may help to account for variation in the data, and which facilitate later uses of the corpus for purposes that were not envisaged when the corpus was created, such as sociolinguistics.
A third property is that there is a sharp division between the original linguistic event captured as an audio recording, and the annotations of that event.
The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact.
Any transformations of that artifact which involve human judgment — even something as simple as tokenization — are subject to later revision, thus it is important to retain the source material in a form that is as close to the original as possible.
As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.
The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.Despite its complexity, the TIMIT corpus only contains two fundamental data types, namely lexicons and texts.As we saw in 2., most lexical resources can be represented using a record structure, i.e. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated.It may come with annotations such as part-of-speech tags, morphological analysis, discourse structure, and so forth.As we saw in the IOB tagging technique (7.), it is possible to represent higher-level constituents using tags on individual words.Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials.