cspy-csp command — mol-CSPy 1.0 documentation (2024)

The cspy-csp command is used to perform crystal structure prediction calculations.

cspy-csp [-h] [-c CHARGES] [-m MULTIPOLES] [-g SPACEGROUPS] [-a AXIS] [-n NUMBER_STRUCTURES] [-p {fit,w99,fit_water_X,Day_halobenzenes,w99_orig_Halogens,w99_orig_H,w99rev_6311,w99rev_6311_s,w99rev_631,w99rev_pcm_6311,w99_s_cl,w99sp,w99rev_pcm_6311_and_Chloride,w99rev_pcm_6311_and_Bromide}] [--cutoff CUTOFF] [--status-file STATUS_FILE] [--nudge NUDGE] [--adaptcell] [--asi] [--clg-old] [--log-level LOG_LEVEL] [--keep-files] xyz_files [xyz_files ...]

positional arguments

  • xyz_files - Xyz files containing molecules for generation (default: None)

optional arguments

  • -h, --help - show this help message and exit

  • -c CHARGES, --charges CHARGES - Rank0 multipole file (default: None)

  • -m MULTIPOLES, --multipoles MULTIPOLES - RankN multipole file (default: None)

  • -g SPACEGROUPS, --spacegroups SPACEGROUPS - Spacegroup set for structure generation (default: fine10)

  • -a AXIS, --axis AXIS - Axis filename for structure minimization (default: None)

  • -n NUMBER_STRUCTURES, --number-structures NUMBER_STRUCTURES - Number of valid structures for structure generation (default: None)

  • -p POTENTIAL, --potential POTENTIAL - intermolecular potential name for struicture minimization (default: fit)

  • --cutoff CUTOFF - DMACRYS real space/repulsion-dispersion cutoff (default: calculate)

  • --status-file STATUS_FILE - Specify output status filename (default: status.txt)

  • --nudge NUDGE - Nudge molecules in assymetric unit that fail QR step (default: 0)

  • --adaptcell - Adaptively optimise cell parameters

  • --asi - Allow molecules to have superimposed centroids (set to true for encapsulation)

  • --clg-old - Use the old version of the crystal landscape generator

  • --log-level LOG_LEVEL - Log level (default: INFO)

  • --keep-files - Keep DMACRYS and NEIGHCRYS files which, for each structure, are stored in a new directory in the pwd.

To use this app, there are a few prerequisites:

  1. Generate 3D geometries for the molecule(s) in question, and choose anappropriate potential type for the interactions in question.

  2. Generate multipoles and a molecular axis file in the appropriateformats using cspy-dma - see cspy-dma command.

Workflow

The cspy-csp command uses the mpi4py package to set up a parent-worker model. The parent (MPI Rank 0) acts as a controllerprocess setting up a work queue with jobs that are then assigned to workers. The workers then perform these tasks and returntheir results back to the parent process.

Tasks are seperated into two categories:

  1. Structure generation

  2. Structure minimization

In structure generation tasks, the worker uses the built-in structure generator to create a crystal structure and returns itto the work queue. In structure minization tasks, the worker receive a crystal structure resulting from the structure generation tasks and attemptsto minimize the structure with respect to its lattice energy. By default, the minimization task involvesthree structure minimizations steps which use PMIN and DMACRYS. However, the user can change any of these steps bysupplying a cspy.toml file. For more information, please refer to the Advanced Configuration section. If the minimizationis successful, the worker returns the optimised crystal structure and this is stored into a SQL database.

The status of the cspy-csp calculation can be tracked in the status.txt file which typically looks like this:

 target valid qr_fail invalid seed running ttime mtime61 20000 20000 5386 5618 25619 0 310916.57 248231.0314 20000 20000 5058 5146 25147 0 210038.67 166707.1119 20000 20000 1577 1626 21627 0 194878.41 153034.972 20000 20000 13080 13179 33180 0 185668.93 147034.074 20000 20000 1350 1417 21418 0 139030.08 119672.9315 20000 20000 13202 13438 33439 0 311009.15 215207.8933 20000 20000 1214 1244 21245 0 152953.92 136771.769 20000 20000 427 609 20610 0 146190.67 118860.529 20000 20000 1572 1712 21713 0 177565.81 136421.295 20000 20000 9259 9573 29574 0 221863.9 128954.27
  • 1st column is the spacegroup number

  • target is the number of valid structures that was requested

  • valid is the number of valid structures that have been obtained

  • qr_fail is the number of trial crystal structures that failed the crystal structure generation step

  • invalid is the number of trial crystal structures that passed the crystal structure generation step but failed the subsequent minimization step.

  • seed is the maximum sobol seed that has been used so far

  • running is a the number of active tasks in the work queue

  • ttime is the total amount of time in seconds that the workers have spent on all tasks

  • mtime is the total amount of tie in seconds that the workers have spent on structure minimization tasks

In addition, a file labelled errors.txt will be generated. This file is designed to give the user more insight into the trial crystal structures which failed the minimization stepand populate the invalid column of the status file. An example of this is shown below:

 timeout max_its buck_cat mol_clash unknown total61 1182 14 43 81 122 144214 976 2 12 425 103 151819 287 9 8 31 12 3472 392 0 0 920 102 14144 163 1 1 79 10 25415 1271 12 15 263 232 179333 134 0 4 20 8 1669 106 1 0 19 29 15529 412 2 8 24 25 4715 786 2 8 355 152 1303summary 5709 43 99 2217 795 8863
  • The 1st column is the spacegroup number

  • summary in the final row gives the total for each error type

  • timeout occurs when a trial structure exceeds the set timeout for the geometry optimisation

  • max_its occurs when a trial structure exceeds the maximum number of geometry optimisation iterations (steps) in DMACRYS

  • buck_cat occurs when the atoms of a trial structure move too close together and become unphysically bound. This can occur when performing a geometry optimisation using a Buckingham potential which is unphysically atrractive at small interatomic distances.

  • mol_clash occurs when there is an overlap between two molecules and is often associated with a buckingham catastrosphe but where the geometry optimisation step passes.

  • unknown occurs when an error does not fall into any of the previous categories (this is likely to be updated in the future for cases that are identified)

  • total is the total number of errors for each spacegroup

Command line usage of cspy-csp

Running a local cspy-csp calculation can be done as follows,

mpiexec -np 4 cspy-csp molecule.xyz -c molecule_rank0.dma -m molecule.dma -a molecule.mols -g 33 -n 100

where molecule.xyz, molecule_rank0.dma, molecule.dma and molecule.mols are the XYZ geometry file, molecularaxis definition, molecular charges and molecular multipoles, respectively. We recommend naming the files similarlyto this example where molecule is replaced with the name of the molecule that you want to perform CSP one.This calculation will use 4 processors, 1 for the controller process and 3 workers. The controller process will assignstructure generation and structure optimisation jobs to the workers. In addition, the -g and -n parameters areused to control the space group that we want to generate structures in, and the number of valid structures, respectively.In this example, we will quasi-randomly generate structures in spacegroup 33, and minimize them until we have 100 valid structures.We could parallelize this further with the -np flag up to the number of cores on the local machine.

Distributed usage of cspy-csp

You can also run cspy-csp on a fixed node allocation. An example submission script for SLURM is as follows,

#!/bin/bash#SBATCH --nodes=2#SBATCH --ntasks-per-node=40#SBATCH --time=24:00:00cd $SLURM_SUBMIT_DIRmodule load conda/py3-latestsource activate cspympiexec cspy-csp molecule.xyz -c molecule_rank0.dma -m molecule.dma -a molecule.mols -g fine10

In this example the -g parameter is a string: fine10. This refers to a customized setting set inside thecspy.configuration module in which the ten most frequent space groups (as observed in the Cambridge Structural Database) are sampled with10,000 valid structures in each space group. Users may define their own custom sampling settings by appending themto the COMMON_SAMPLING_SETTINGS dictionary in cspy/configuration.py.

Below is the example of the dictionary input for the co-crystal_fine sampling setting. This is made up from the10 most common spacegroups for co-crystals. The number of valid structures is typically larger for spacegroups thatare observed more frequently such as P21 /c (14) and C2/c (15).

"co-crystals_fine": { # 10 most common spacegroups for co-crystals "space_group": {2, 14, 15, 4, 19, 61, 1, 33, 5, 9}, "number_structures": { 2: 10000, 19: 10000, 4: 20000, 61: 20000, 14: 50000, 15: 50000, 1: 10000, 33: 20000, 5: 20000, 9: 20000, },

Note

If the following error occurs with the above submission script change source activate cspy to conda activate cspy:

Could not find conda environment: cspy. You can list all discoverable environments with `conda info –envs`.

Running cspy-csp on ARCHER2

Note that, as per the installation guide, it is best to avoid usingconda on ARCHER2 and instead to use a Python virtual environment.Assuming you have made a virtual environment and installed mol-CSPy and itsvarious dependencies, the following would run a distributed CSP on 2nodes on ARCHER2’s short queue:

#!/bin/bash --login#SBATCH --nodes=2#SBATCH --ntasks-per-node=128#SBATCH --cpus-per-task=1#SBATCH --time=0:20:00#SBATCH --partition=standard#SBATCH --qos=short#SBATCH --account=e05-discov-day#SBATCH --distribution=block:block#SBATCH --hint=nomultithread# Prevents any threaded system libraries from automatically using threading.export OMP_NUM_THREADS=1export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK# Prevents "STOP" message being printed from DMACRYS failures when compiled with Cray compilersexport NO_STOP_MESSAGE=1# IMPORTANT: Prevents MPI deadlock when running CSPy across nodes on ARCHER2# (MPI message buffer sizes in bytes)export FI_OFI_RXM_SAR_LIMIT=524288export FI_OFI_RXM_BUFFER_SIZE=131072cd $SLURM_SUBMIT_DIRmodule load cray-pythonexport PATH=$PATH:/work/e05/e05/<user>/<path_to_executables>source /work/e05/e05/<user>/<path_to_venv>/bin/activatesrun cspy-csp <inputs>

Note in particular the additional environment variables

export FI_OFI_RXM_SAR_LIMIT=524288 # 512 kBexport FI_OFI_RXM_BUFFER_SIZE=131072 # 128 kB

which are necessary to allow ARCHER2 to run mol-CSPy across nodes withoutMPI deadlocks occurring.

By increasing the size of these buffers in line with the profilingadvice on the ARCHER2 UserGuide,this problem appears to be avoided (CSP jobs on up to 32 nodes have beenverified).

cspy-csp command — mol-CSPy 1.0 documentation (2024)
Top Articles
Latest Posts
Article information

Author: Jerrold Considine

Last Updated:

Views: 6203

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.