## Thursday, March 31, 2016

### Reviews for Towards a barrier height benchmark set for biologically relevant systems

2016.04.06: our rebuttal can be found here

The reviews our latest PeerJ submission (editor assigned February 25) came back last evening with the alway welcome verdict "minor revision".  The preprint has already received over 400 views and 85 downloads.

 Reviewer Comments Reviewer 1 (Xabier Lopez) Basic reporting No Comments Experimental design No Comments Validity of the findings No COmments Comments for the author The present papers is a first step towards the generation of a reaction database for biological systems. To do so the authors provide with highly accurate DFT and ab-initio structural and thermodynamic data for a set of five enzyme reactions. The results are compared with various PM# type semi-empirical methods and DFTB3. The work sets up an standard in terms of semi-empirical method validation, that hopefully, it will cristalize in a joint effort with the rest of the theoretical chemist community to create a database of biologically relevant reactions. In this sense, I find this paper with a high potential and should be publised. Although the set of reactions is small as to draw general conclusions, I understand that the goal of the authors is to set up an standard so that other theoretical chemistry groups worldwide will contribute to this joint effort. I think that this is a very good and necessary idea. I would suggest that it would be useful in the github.com/jensengroup/db-enzymes dataset that the authors would consider to provide with standard inputs for the programs that should be considered as standard for other theoreticians as well. In this way, the desirable growth of this dataset by contributions from other groups will not jeopardize its necessary coherency. Reviewer 2 (Anonymous) Basic reporting The authors present a compilation of model systems for five, real-life enzymatic reactions based on previously published structures, reaction barrier heights and reaction energies that had been obtained with the B3LYP density functional theory (DFT) approximation. These DFT structures and numbers are used as a reference to examine various cost-efficient semi-empirical methods. The authors consider both geometry optimisations, as well as single-point energy calculations. Furthermore, they also comment on the effect of a solvation model on the results. The main motivation of this study is clearly outlined and it is important to note that the authors point out that their study is only the first step in a long-term undertaking that aims at having a diverse set with more accurate reference energies. The language is clear and professional. Raw data is supplied through a link to figshare and can be easily accessed and examined by the readers of this paper. With regards to cited literature and the figures, I have the following suggestions to make: a) I think the cited literature should be more comprehensive and also take into consideration recent QM studies on peptides and proteins that allow a better understanding of the quality of the herein used DFT reference values; I will elaborate on this point more in detail under “validity of the findings”. I also noticed that some cited articles lack a volume and page number; I therefore recommend that the authors carefully check their list of references for any mistakes. b) While the general quality of the figures is very good, I find Figures 3 - 6 hard to read. In these figure, the authors attempt to compare geometries for two levels of theory with each other. While one does note differences, the structures lack a common reference point that would allow the reader to better gauge whether these differences are due to structural changes or due to slightly different viewing angles. Would it be possible to improve these figures and to align the geometries with respect to one of the involved molecules? Experimental design I identified some issues that would make it hard for others to exactly reproduce the results: a) page 2, line 78: The authors say that during the geometry optimisations, the position of some atoms were restrained. It is not clear which atoms these are and why that choice was made. b) In section 3.3 (and in Tables 2 and 3) the authors refer to different models for each of the five tested reactions that all vary in system size. This is the first time in the manuscript that the authors refer to this and it is hard to follow how exactly these models differ from each other and what they actually look like. It therefore makes it hard to follow this section and I recommend a revision. Validity of the findings As mentioned above, I do understand what the authors are trying to achieve with their study. It is supposed to only serve as a starting point for the future development of an accurate benchmark set. Also, it is likely that the authors’ current computational capabilities are limited and that they therefore use previously published structures and energies. Unfortunately, the chosen level of theories for the geometry optimisations [B3LYP/6-31G(d,p)] and the singlepoint energy calculations [B3LYP/6-311+G(2d,2p)] are questionable for the studied systems and properties. I will first elaborate on this statement, starting with the quality of the geometries, before I will conclude with a recommendation on how the authors could address this problem without substantially delaying publication. The structural stability of biomolecular systems - and of any sizeable molecule or molecular aggregate – is heavily influenced by important London-dispersion (van-der-Waals) effects, and it is a well-known fact among quantum chemists that B3LYP does not cover these effects. Moreover, small basis sets, such as 6-31G(d,p) induce the so-called basis-set superposition error (BSSE), which is an artificial overstabilisation of noncovalent interaction energies, including London dispersion and hydrogen bonds. BSSE exists both for inter- and intramolecular interactions and it has been demonstrated in the literature that it distorts molecular structures, see e.g. early works by van Mourik and co-workers on peptide conformers: -van Mourik, T.; Karamertzanis, P. G.; Price, S. L., J. Phys. Chem. A 2006, 110, 8. - Holroyd, L. F.; van Mourik, T., Chem. Phys. Lett. 2007, 442, 42. Later, Goerigk and Reimers showed how error compensation between the lack of a proper dispersion treatment and BSSE can artificially generate structures of seemingly acceptable quality. However, it was also demonstrated that this error compensation cannot be controlled and that in fact using dispersion and BSSE corrections led to more accurate and reliable structures. This was first discussed for gas-phase structures of peptides, and later for the crystal structure of a protein fragment: - Goerigk, L.; Reimers, J. R., J. Chem. Theory Comput. 2013, 9, 3240. - Goerigk, L.; Collyer, C. A..; Reimers, J. R., J. Phys. Chem. B 2014, 118, 14612. Efficient corrections for both London dispersion and BSSE exist that do not compromise the computational efficiency of B3LYP/6-31G(d,p) and they should also be mentioned (e.g. Grimme’s DFT-D3(BJ) and gCP corrections). This is particularly important, as this journal is directed at a readership that may not be familiar with these methodological developments and the dispersion and BSSE problems. While BSSE is negligible for the 6-311+G(2d,2p) basis set, the singlepoint energies still suffer from the London-dispersion problem. The authors already mention the large GMTKN30 benchmark set in their manuscript. An additional study on GMTKN30 with nearly 50 methods has conclusively shown how dispersion can influence reaction energies and barrier heights by several kcal/mol (Goerigk, L.; Grimme, S.; Phys. Chem. Chem. Phys. 2011, 13, 6670.). A proper treatment of dispersion is therefore crucial. That same study has also shown that B3LYP is worse than the average of more than 20 hybrid density functionals for reaction energies. Consequently, using B3LYP/6-311+G(2d,2p) as a benchmark for the study of other methods may therefore distort the findings, and statistical values such as mean absolute deviations have to be interpreted with caution. In fact, it may well be that some of the tested semi-empirical methods perform much better than what the reported MADs may suggest. In this context, note that a recent study by Karton and Goerigk has demonstrated how the quality of the benchmark reference heavily influences the interpretation of low-level methods used to calculate barrier heights of pericyclic reactions (J. Comp. Chem. 2015, 36, 622). If one analyses the results of this study carefully, one can also see the importance of dispersion; for instance, it lowers the B3LYP barrier height of the Diels-Alder reaction between 1,3-butadiene and ethene by nearly 6 kcal/mol! After this explanation, please let me emphasise that I am not against the way this study has been carried out. The studied reactions are highly interesting and the actual reaction barriers and energies provide useful insights for quantum chemists. Recalculating the reported numbers may therefore not be necessary at this stage and it can be postponed to a future study. Nevertheless, I am of the opinion that this manuscript can benefit from a discussion of the above points, particularly in the outlook section. Dispersion and BSSE should not be neglected and this manuscript is an ideal platform to convey this message to a non-expert readership that may not be aware of how the field of DFT approximations has changed over the past years. Such a discussion should also mention that also the tested semi-empirical methods lack from a proper description of noncovalent interaction energies and that others have combined them with DFT-D2 or DFT-D3-type corrections (DFTB3-D3, PM6-D3, etc. ) and with additional hydrogen-bond corrections for PM6 (see e.g. the works by Hobza and Korth). One or two paragraphs discussing the importance of these points are therefore sufficient. Comments for the author page 3, line 104: Is the quotation mark at the beginning a mistake or does it introduce a literal quote from another paper? If the latter is the case, a second quotation mark at the end of the quote is missing. Reviewer 3 (Anonymous) Basic reporting I think this paper represents a nice effort of building a set for benchmarking barrier height in modelling biochemical systems. The authors provide themselves the benchmark for several semiempirical methods, using B3LYP as a reference point. Nevertheless, as the authors acknowledge, they provide a limited number of systems studied and a reference state (B3LYP) which could not be considered the state of the art. Thus, in my opinion the manuscript should be considered more as proposing the benchmark idea than a rigorous initial step towards it. In this regards I miss some section underlying how to expand the set, with some practical example. It could be even more interesting some efforts in making that part easier (semi-automatic) from standard QM or QM/MM calculations (although this part could be expanded in the future). Experimental design Since most reports are based on the comparison with B3LYP data (it seems to me that from other sources) I wander if the authors have reevaluated some of these numbers. Being single point evaluations it should not require excessive computational time. Validity of the findings To help the inspection of the results (and differences between B3LYP and semiempirical methods), I would like to see the distances (at least the most changing ones) being added in Figures 2-5. Comments for the author minor comments: .) (line 47): "...TS structures are known to dependent significantly" .) (line 104) " ...kcal/mol. “The larger..." (a non closed ") .) (line 116) "PM7 (to 4.0 and" (non closed parenthesis)

## Sunday, March 27, 2016

### Generating protonation states and conformations

We're working on pKa prediction using semiempirical methods and need to generate coordinates for all possible protonation states of a large number of amines and perform conformer search.  Here are my current thoughts so far.

We have the amine names so we can use cactus to generate SMILES strings and then Babel to generate coordinates and do the conformer search.  However, Babel only (1) generates neutral protonation states, (2) does not consider C-NH2 bonds in primary amines rotateable, (3) does not consider conformers relates to inversion of N.  Babel also has some additional problems that I'll describe at the end of the post.

Some easy fixes
Most of the problems can be addressed fairly easily. It was easy to write a program that reads a mol2 file (where atom typing is mostly done) and protonates all appropriate N atoms.

Similarly, it also seems easy to write a program that generates all possible protonation states by removing select protons from the fully protonated coordinate file.  Furthermore, by removing all protons on primary and secondary amines one at a time one will help solve the problems with C-NH2 rotation in primary amines and the inversion of secondary amines.

The main problem is the inversion of tertiary amines
Actually, it is not heard to generate a crappy guess at the inverted structure simply by placing the H atom on the opposite side of the tertiary N.  When do this manually in Avogadro by moving the H atom and optimising the structure it works fine.  But generating such coordinates in an xyz file and then passing them to Babel doesn't work because the atom typing is screwed up.  I tried generating mol2 files instead but for some reason Babel perform a conformer search starting from a mol2 file.

Another option could be to generate a z-matrix (gzmat) and then write code that does the inversion by changing the sign of select dihedral angles.

Yet another option could be to do it directly from the SMILES string, e.g. C[N@H+](F)Cl vs C[N@@H+](F)Cl.  The challenge here for me would be to only do this for select N atoms. In general writing a SMILES parser seems like a big job.

I am open to suggestions here

Additional problems
Babel does not generate the correct stereoisomer for molecules with chiral centers involving several rings.  It does print out an error message, but it remains to be seen how many of the amines we are interesting in are affected by this.

Babel also doesn't consider C-OH bonds rotateable so we have to somehow generate the start conformers separately.  This also applies to COOH groups.

Babel generates -COOH groups, so we also have to generate the COO- form.

Other programs
Perhaps we should look at CDK, RDkit, or PubChemPy instead.

This work is licensed under a Creative Commons Attribution 4.0

## Thursday, March 17, 2016

### Some useful spreadsheet commands

Here are some very useful spreadsheet commands that I have been using a lot lately.  The syntax is for Google Sheets.  To use in Excel replace the ","s by ":"s

SUMIFS Returns a sum based on one or more criteria (if you only have one criteria you can use SUMIF).  Say you have

A C 2
B C 1
A D 4

You can use SUMIFS to add the "A-numbers", i.e. 6.  If you want the sum for  "A C" then use SUMIFS.

One of my side jobs is to keep track of teaching hours reported using a Google Form.  I use SUMIF to add all the reported hours together for a given person.

VLOOKUP Searches down the first column of a range for a key and returns the value of a specified cell in the row found.

You can use VLOOKUP to find the value corresponding to "B", i.e. 1.  If you use the same command to find the A value it will return the first one it found, i.e. 2.  VLOOKUP does not support multiple criteria, but if you want to look up the value for "A and D" you can use CONCATENATE to combine the two cells and lookup the value for "AD".

A more concrete example: let's say you have a long list of conformer names and energies.  You can use MIN to find the lowest energy and combine it with VLOOKUP to give you the conformer name.

CONTATENATE I've already mentioned this command, but here's another example.  In some of the papers I make tables with RMSD and $r$ values in the same column, e.g.: 0.2 (0.92).  This is easy to make from separate tables with RMSD and $r$ values with  CONCATENATE(TEXT(I4,"0.0")," (",W4,")")

SPLIT is the opposite of CONTATENATE.  SPLIT can be combined with INDEX to do some pretty nifty things.  For example INDEX(SPLIT(INDEX(SPLIT(B5,"("),0,2),"%"),0,1) will extract the number 92 from the text "B8FX10 Buried chi1: 23 correct out of 25 (92%)"

Finally the IF command is incredibly useful.  IF is actually an "if-then-else" statement.  For example IF(B15>3,"the number is bigger than 3","the number is smaller than 3") returns "the number is smaller than 3" if B15 contains the number 2.

Notice that IF commands can be combined: IF(B15>3,"the number is bigger than 3",if(B15>1,"the number is smaller than 3 but bigger than 1","the number is smaller than 1"))returns "the number is smaller than 3 but bigger than 1" B15 contains the number 2.

This work is licensed under a Creative Commons Attribution 4.0