Advanced Storage Stratagies¶
Storing data greater than 16 Mb¶
Introduction¶
As the analysis capabilities of the high-throughput computational material science grows, more and more data is required to perform the analysis.
While the atomic structure and metadata about the calculations can be stored within the 16 Mb limit of MongoDB’s data framework the storage requirement for field data like the charge density is often orders of magnitude larger than that limit.
As such, alternative methods are required to handle the storage of specific fields in the parsed task document.
For these fields, with larger storage requirements, the data itself will be removed from the task document and replaced with and fs_id
to reference where the data is stored.
This results in a task document with a much smaller size, which will still be uploaded onto the MongoDB specified in your DB_FILE
Two approaches are currently available in atomate to handle the storage of large data chunk.
The user can implement a GridFS-based chunking and storage procedure as we have done in VaspCalcDb
.
But the recommended method for large object storage is to use maggma
stores implemented in the CalcDb
class.
Currently, only the Amazon S3 store is implemented.
Please read the documentation for maggma
for more details _maggma: https://materialsproject.github.io/maggma
Configuration¶
To storing the larger items to an AWS S3 bucket via the maggma
API, the user needs to have the maggma_store
keyword in present in the db.json
file used by atomate.
{
"host": "<<HOSTNAME>>",
"port": <<PORT>>,
"database": "<<DB_NAME>>",
"collection": "tasks",
"admin_user": "<<ADMIN_USERNAME>>",
"admin_password": "<<ADMIN_PASSWORD>>",
"readonly_user": "<<READ_ONLY_PASSWORD>>",
"readonly_password": "<<READ_ONLY_PASSWORD>>",
"aliases": {}
"maggma_store": {
"bucket" : "<<BUCKET_NAME>>",
"s3_profile" : "<<S3_PROFILE_NAME>>",
"compress" : true,
"endpoint_url" : "<<S3_URL>>"
}
}
Where <<BUCKET_NAME>>
is S3 bucket where the data will be stored, <<S3_PROFILE_NAME>>
is the name of the S3 profile from the $HOME/.aws
folder.
Note, this AWS profile needs to be available anywhere the VaspCalcDb.insert_task
is called (i.e. on the computing resource where the database upload of the tasks takes place).
Usage¶
Example: store the charge density
To parse a completed calculation directory. We need to instantiate the drone
with the parse_aeccar
or parse_chgcar
flag.
calc_dir = "<<DIR>>/launcher_2019-09-03-11-46-20-683785"
drone = VaspDrone(parse_chgcar=True, parse_aeccar=True)
doc = drone.assimilate(calc_dir)
task_id = mmdb.insert_task(doc)
Some workflows like the StaticWF
will pass the parsing flags like parse_chgcar
to the drone directly.
To access the data using the task_id we can call
chgcar = mmdb.get_chgcar(task_id)
Similar functionalities exist for the band structure and DOS.
Please refer to the documentation of VaspCalcDb
for more details.