Science. Shared. Data.: August 2013

Friday, August 23, 2013

Be careful when you extrapolate to the complete basis set limit!

In our soon-to-be-finised manuscript on shielding constants obtained through the QM/MM with the MM-region described by a high-quality polarizable embedding potential (PE), I've done a thing I should have done a while ago. But, as it is always the case with these things, you end up doing it in the wrong order.

I read the blogpost on basis set extrapolation by +Jan Jensen and thought I should try. Here is what I got by calculating the shielding constants for oxygen-17 in acrolein using KT3/pcS-n//B3LYP/aug-cc-pVDZ solvated by a 12 Å sphere of explicit water molecules described by the PE potential.

pcS-0	pcS-1	pcS-2	pcS-3
-309.2	-216.9	-191.6	-188.2

How do you continue from these values to an estimate of the basis set limit at infinity? According to Jans blog post, you should fit your data according to
$$
Y(l_{max}) = Y_{CBS} + a \cdot e^{b \cdot l_{max}}
$$
to get the most accurate result. I have done so in using WolframAlpha and I have obtained the following plot (blue dashed line) with $Y_{CBS}=-176.5$ ppm.

"That is a terrible fit!", I can hear you say. And indeed it is. What is going on? It turns out (and please give me some references if you know them!) that the pcS-0 results are actually too bad to be taken seriously. The single zeta basis set is not enough to even get a qualitatively correct description of that wave function, i.e. what you get is just wrong. If you remove that point you get the solid blue curve which is a really good fit with $Y_{CBS}=-187.7$ ppm.

If you want to use the alternative extrapolation scheme that Jan provides, i.e.
$$
Y(l_{max}) = Y_{CBS} + b\cdot l_{max}^{-3}
$$
one obtains the red solid curve with $Y{CBS}=-182.9$ ppm which 5 ppm off from the exponential fit.

As Frank (and Grant) are commenting below, one should not trust numbers from SZ basis sets and +Anders Steen Christensen noted that even the DZ results could/should be disregarded. The only problem is that the size of my calculations are increasing, and a 5Z calculation is pretty much out of reach beyond the TZ basis set. Jans post does mention that one should use a DZ quality basis set, shame on me I guess for even trying with pcS-0.

Just be careful out there and remember to extrapolate!

edit1: fixed the last formula

edit2: Frank Jensen has given a lengthy comment on the matter which gives a lot of insigth. Read it below. I've added some clarifying text here and there based on his comments and will likely follow up on it in a later blog post.

Saturday, August 3, 2013

What is it with this linear scaling stuff anyway?

Enormous amounts of research time has gone into researching computational methods that are linear scaling with respect to the system size. That is, double the size of your system and you only double the computation time. If just all methods were as such, the queue on your local super computer cluster would be easier to guess when computers were available instead of seeing a wall of 200+ hours of jobs just sitting there because people don't give a crap.

Inspired by +Jan Jensen and a recent blog post of his (which I was reminded of when I wrote another blog post on the subject of many-body expansions), I set out to actually do the calculations on timings myself albeit with a different goal in mind.

2-body calculations
Even if you use the many-body expansion of the energy, I showed that the accumulated number of calculations one would need increases dramatically for large N-body. If we only focus on doing one- and two-body calculations, the effect is barely visible in the previous plot, but calculating the computational time from Jan's linear model (only do nearest neighbors) together with one where we do all pairs, we see that even at the two-body level, there is no linear scaling unless you do some approximations.

Here, I have assumed a computational scaling of $\alpha=2.8$ and uniform monomer sizes. I've assumed that a monomer calculation takes 1s and there is no overhead nor interaction at the monomer level.

Admittedly, the linear model is crude, but it shows the best scaling you could hope for by including the minimum amount of two-body calculations. In a more realistic case, you would end up somewhere between the red and the black line, but that is the subject for a future post.

This is why we need linear scaling!

3-body calculations
Just for the fun of it, here is the 3-body scaling

and I dare not think of what the time would be for the calculation without approximations for higher n-body calculations.

I think that we can all agree on that approximations must be made or else we are doomed.

We need linear scaling!

This work is licensed under a Creative Commons Attribution 3.0 Unported License.