Christoph HOFFMANN | Martina TROGNITZ, Austria | June 4, 2018 | 10 am

Invitation to the round table “Data Mountains – The Great, the Large and the Ugly”.

The proliferation of new technologies throughout the humanities has become a commonplace. Both the use of cutting edge hardware producing digitised representations of objects of all scales from small shards and tools to entire landscapes and areas, and the utilisation of innovative software for visualisation and analysis are virtually ubiquitous in research projects.

(© CC-BY Martina Trognitz)

With the increasing development velocity of said technologies it has however become difficult to adapt and develop the structures for management, archiving, and preservation of research data at an equal pace. Bearing in mind that in line with the FAIR principles, research results should both be sustainably preserved for future use and easily retrievable, we are confronted with a variety of technological challenges, some of which we’d like to outline:

  •  Data Integrity and Preservation of implicit Information
    During production and processing of data, a set of measures has to be taken to ensure integrity of data. This becomes important especially in remote working areas and/or limited technical infrastructures.
    For long term preservation ‘classical’ bitstream preservation of concrete files is a well established practice. But the conservation of the increasing amount of information implicitly encoded into filenames and directory structures confronts us with new problems. In order to properly ingest and preserve particular files within a repository, files are often accrued into collections according to types and sources. In many cases this leads to a loss of tactile information. The solution may be to embed practices enforcing preservable structures already at the time of production of data. Another way may be to extract and convert this information into explicit metadata. Which of these is feasible and in what scenarios and combinations? How can we ensure a minimum of information loss whilst still guaranteeing long term preservability of data?
  •  Compound Resources
    Connected to the previous point is an issue, which is not new, but arises on a whole new scale. Whilst bitstream preservation ensures the accurate reproduction of a file, the usability of many resources depends on multiple files being available in specific directory structures and bundles. Right now there are no specific rules and techniques for the long term preservation of these structures particularly within large, complex file systems that make manual analysis impossible. How can we ensure the integrity of such formats, especially with regard to an increasing pace of innovation of the software landscape?
  • Dissemination and Derivation
    Whilst bandwidth of most network connections and processing power has tremendously improved in the last decades, it has not kept up with the amounts of data produced in modern digitisation and research endeavours. In order for large resources to be sensibly retrievable and thus reusable, compression and derivation is a key issue. For basic (e.g. image) resources, conversion can be done on the fly. But larger resources, such as videos and 3D models or high quality scans, require derivatives to be rendered well in advance to be available within a reasonable timeframe. Should these derivatives be therefore part of the actual preservation set? Or should they be cached elsewhere? What are other options for dissemination of large files?

(© CC-BY Martina Trognitz)

We would like to discuss with participants about their experiences with projects producing and managing a massive amount of data, including best practices, lessons learned and what to avoid We also want to collect recommendations for data management and dissemination.

previous BLOG Post next BLOG Post