Data Management and Archiving: Beginning and Intermediate Level
Nick Thieberger
Course Information
This course will focus on the creation of good data from linguistic fieldwork. The basic principle advocated is to create data once in the appropriate format so that it can be reused many times. From the recording through analysis, to the archive and community focused outputs, how can we keep track of what we have done and what stage of processing it is at? How can we transform data from the output of one tool to the input of another? What tools and processes can we use and what does each of them do? This course will contextualise some of the other courses at Infield, showing how the various tools that are being taught fit into a workflow and stressing the importance of allowing the underlying data to flow between tools and then into an archive. Topics to be covered in this course include:
Metadata
File-naming conventions
Databases for tracking metadata
How much metadata is enough?
Available linguistic metadata sets and how to select among them.
The linguistic fieldwork workflow
How each tool fits into the workflow (transcription, annotation, corpus-development, lexicon, interlinear glossed texts, data conversion using regular expressions).
Processes and current tools
Transcription with time-alignment
Annotation
Lexical database
Interlinear text production
What is 'well-formed data'?
Distinguishing the form and content of data to allow multiple outputs from the same underlying data
Language documentation requires archives
Presentation of key linguistic archives, what they offer and how to use them.
What is required to establish a digital archive (if there is interest)
Institutional repositories for data and for research publications (the Open Access movement)
Mobilization
Multiple possible outputs from well-formed data.
Instructor(s) Bio
Nick Thieberger works with Warnman, an Indigenous language from Western Australia and South Efate, a language from central Vanuatu, for which he developed a method for citing archival recordings created during fieldwork, presenting a DVD of playable example sentences and texts in the language together with the published grammar. In 2003 he helped establish the Pacific And Regional Archive for Digital Sources in Endangered Cultures (paradisec.org.au) and continues as the project officer with this multi-institutional archiving project that holds 4.4Tb of data, including 2,440 hours of digitised audio files. He leads a team that is building EOPAS, an online database for presentation of interlinear glossed text with media. In 2008 he established Kaipuleohone, the linguistic archive at the university of Hawai'i. He is interested in developments in e-humanities methods and their potential to improve research practice and he is now developing methods for creation of reusable data sets from fieldwork on previously unrecorded languages. He is the technology editor for the journal Language Documentation and Conservation. He is an Australian Research Council QEII Fellow at the University of Melbourne and an Assistant Professor in the Department of Linguistics at the University of Hawai'i at Mānoa.