Advanced Storage Stratagies

Storing data greater than 16 Mb

Introduction

As the analysis capabilities of the high-throughput computational material science grows, more and more data is required to perform the analysis. While the atomic structure and metadata about the calculations can be stored within the 16 Mb limit of MongoDB’s data framework the storage requirement for field data like the charge density is often orders of magnitude larger than that limit. As such, alternative methods are required to handle the storage of specific fields in the parsed task document. For these fields, with larger storage requirements, the data itself will be removed from the task document and replaced with and fs_id to reference where the data is stored. This results in a task document with a much smaller size, which will still be uploaded onto the MongoDB specified in your DB_FILE

Two approaches are currently available in atomate to handle the storage of large data chunk. The user can implement a GridFS-based chunking and storage procedure as we have done in VaspCalcDb. But the recommended method for large object storage is to use maggma stores implemented in the CalcDb class. Currently, only the Amazon S3 store is implemented.

Please read the documentation for maggma for more details _maggma: https://materialsproject.github.io/maggma

Configuration

To storing the larger items to an AWS S3 bucket via the maggma API, the user needs to have the maggma_store keyword in present in the db.json file used by atomate.

{
    "host": "<<HOSTNAME>>",
    "port": <<PORT>>,
    "database": "<<DB_NAME>>",
    "collection": "tasks",
    "admin_user": "<<ADMIN_USERNAME>>",
    "admin_password": "<<ADMIN_PASSWORD>>",
    "readonly_user": "<<READ_ONLY_PASSWORD>>",
    "readonly_password": "<<READ_ONLY_PASSWORD>>",
    "aliases": {}
    "maggma_store": {
            "bucket" : "<<BUCKET_NAME>>",
            "s3_profile" : "<<S3_PROFILE_NAME>>",
            "compress" : true,
            "endpoint_url" : "<<S3_URL>>"
    }
}

Where <<BUCKET_NAME>> is S3 bucket where the data will be stored, <<S3_PROFILE_NAME>> is the name of the S3 profile from the $HOME/.aws folder. Note, this AWS profile needs to be available anywhere the VaspCalcDb.insert_task is called (i.e. on the computing resource where the database upload of the tasks takes place).

Usage

Example: store the charge density

To parse a completed calculation directory. We need to instantiate the drone with the parse_aeccar or parse_chgcar flag.

calc_dir = "<<DIR>>/launcher_2019-09-03-11-46-20-683785"
drone = VaspDrone(parse_chgcar=True, parse_aeccar=True)
doc = drone.assimilate(calc_dir)
task_id = mmdb.insert_task(doc)

Some workflows like the StaticWF will pass the parsing flags like parse_chgcar to the drone directly.

To access the data using the task_id we can call

chgcar = mmdb.get_chgcar(task_id)

Similar functionalities exist for the band structure and DOS. Please refer to the documentation of VaspCalcDb for more details.