The open archive for STFC research publications

Full Record Details

Persistent URL http://purl.org/net/epubs/work/50089
Record Status Checked
Record Id 50089
Title The use of file description languages for file format identification and validation
Abstract If an archive is to digitally preserve scientific data it is vitally important that the archive also curates the OAIS [1] representation information for that data. The representation information for the data can provide the archive with many additional capabilities, one of which is the ability to identify the data file type received from a data producer and another is validating its structure. There are a range of file identification and structural verification mechanisms available to an archive to verify that the data they are receiving from a data producer is what they are expecting. From simple file extension checking, to the use of command line tools such as the BSD UNIX file command or the National Archives Droid tool [4] in conjunction with file format signature registries such as PRONOM [4] or the Global Digital Format Registry [12], through to the use of sophisticated data description languages such as EAST, DRB [8], XML Schema or DFDL [9] as used in the CASPAR project. The use of file extensions for identification suffers well known disadvantages. Using file signature checking is useful but is not totally reliable due to the lack of granularity identified from the signature and the possibility of false identification through coincidentally equal signature formats, this is especially a problem with bespoke data formats where identification through signatures was not a concern at the time of conception. Using file signatures does not provide any data structure validation. The use of data description languages to describe the internal structure of the data right down to the bit level will provide a more holistic solution allowing a full and reliable identification of a file format as well as validating its structure. Tools and generic software APIs that could solve the problems of file identification and structure validation by using a combination of file format signatures and data descriptions stored in a OAIS representation information registry will be presented and discussed.
Organisation ESC , ESC-IM , STFC
Keywords XML , OAIS , file formats , digital preservation , digitial curation , data description languages , EAST , DRB , DFDL , CASPAR
Funding Information
Related Research Object(s):
Licence Information:
Language English (EN)
Type Details URI(s) Local file(s) Year
Paper In Conference Proceedings In PV 2007 Conference (PV 2007), DLR, Oberpfaffenhofen/Munich, Germany, 9-11 Oct 2007, (2007). Dunckley_fileFormatIdentification.pdf 2007