Data Management and Archiving: Beginning and Intermediate Level

Nick Thieberger

https://web.archive.org/web/20100615073452/http://logos.uoregon.edu/infield2010/workshops/data-management-archiving/index.php

Course Information

This course will focus on the creation of good data from linguistic fieldwork. The basic principle advocated is to create data once in the appropriate format so that it can be reused many times. From the recording through analysis, to the archive and community focused outputs, how can we keep track of what we have done and what stage of processing it is at? How can we transform data from the output of one tool to the input of another? What tools and processes can we use and what does each of them do? This course will contextualise some of the other courses at Infield, showing how the various tools that are being taught fit into a workflow and stressing the importance of allowing the underlying data to flow between tools and then into an archive. Topics to be covered in this course include:

Metadata

  • File-naming conventions

  • Databases for tracking metadata

  • How much metadata is enough?

  • Available linguistic metadata sets and how to select among them.

The linguistic fieldwork workflow

  • How each tool fits into the workflow (transcription, annotation, corpus-development, lexicon, interlinear glossed texts, data conversion using regular expressions).

Processes and current tools

  • Transcription with time-alignment

  • Annotation

  • Lexical database

  • Interlinear text production

What is 'well-formed data'?

  • Distinguishing the form and content of data to allow multiple outputs from the same underlying data

Language documentation requires archives

  • Presentation of key linguistic archives, what they offer and how to use them.

  • What is required to establish a digital archive (if there is interest)

  • Institutional repositories for data and for research publications (the Open Access movement)

Mobilization

  • Multiple possible outputs from well-formed data.

Instructor(s) Bio

Nick Thieberger works with Warnman, an Indigenous language from Western Australia and South Efate, a language from central Vanuatu, for which he developed a method for citing archival recordings created during fieldwork, presenting a DVD of playable example sentences and texts in the language together with the published grammar. In 2003 he helped establish the Pacific And Regional Archive for Digital Sources in Endangered Cultures (paradisec.org.au) and continues as the project officer with this multi-institutional archiving project that holds 4.4Tb of data, including 2,440 hours of digitised audio files. He leads a team that is building EOPAS, an online database for presentation of interlinear glossed text with media. In 2008 he established Kaipuleohone, the linguistic archive at the university of Hawai'i. He is interested in developments in e-humanities methods and their potential to improve research practice and he is now developing methods for creation of reusable data sets from fieldwork on previously unrecorded languages. He is the technology editor for the journal Language Documentation and Conservation. He is an Australian Research Council QEII Fellow at the University of Melbourne and an Assistant Professor in the Department of Linguistics at the University of Hawai'i at Mānoa.

Previous
Previous

Principles of Database Design

Next
Next

Video Recording and Editing 2