GOTHAM molecule-a-thon
TL;DR
Find molecules that we can look for in GOTHAM spectra, and aggregate their spectroscopic parameters from the literature.
Your quest, should you choose to accept
The diagram below illustrates what the observational/experimental analysis pipeline looks like for GOTHAM.
In the spirit of collaboration and inspired by open-source development, you are more than welcome to work on any part of this pipeline to retirement. The main bottleneck for the observational (and to a certain extent experimental) sides right now, however, is collecting information on what is known and what is not known for molecules that are of interest to us.
What can I do to help?
For just a molecule a day, you can help Kelvin get access to fresh molecules for stacking in the GOTHAM data.
What we need for this is just a dedicated effort going through the literature and collecting what is (un)known in terms of rotational spectroscopy data for molecules we might be interested in. What molecules, you ask? You can choose from those identified with machine learning, or if you have some in mind, any contributions are welcome.
The main thing we ask of you is just consistency. What we (Kelvin) don't want is creating more work by inconsistent file formats, missing, and/or untracable data. To streamline the whole process, we need things to be:
- Machine readable
- Referenced
- Identifiable
What doesn't help is sending a catalog simulated at a temperature you don't know, without the spectroscopic parameters and dipole moment information, or where the data came from.
To help clarify what we want, we'll give an example of the typical workflow someone might do to think about, find, and organize spectroscopic information.
HC5N
Let's say we wanted to look for HC5N. It's a linear molecule with nitrogen and so we expect one rotational constant, B, maybe some centrifugal distortion terms, and a quadrupole term because of the nitrogen.
The quickest way is to look at CDMS' entry for a molecule:
The right hand table has some of the terms we want, the principal rotational constants A, B, C, but not all (e.g. the quadrupole terms). Now we have to dig!
From there, we can look at the primary sources of data, which are given as references. Generally, take the the most recent paper as your source.
Let's assume, however, that we didn't have CDMS to guide us—what do we do instead? This is what I usually do:
You then systematically go through these papers, and generally speaking look for a table that looks vaguely like this:

Aside from the nitrogen quadrupole term, we have most of what we need. To get it into a machine readable, consistent format, we use the "Yet Another Markup Language" (YAML) file format, which looks something like this, where all values are in MHz and Debye:
# HC5N.yml B: 1331.332687 D: 30.1090e-6 H: 1.635e-12 doi: "10.1016/j.jms.2004.02.019" u_a: 4.33 smiles: "C#CC#CC#N" name: "HC5N" formula: "HC5N" notes: "No hyperfine structure." author: "KLKL"
The point here is to not need to think about how to write this with SPCAT—the sewers
has functionality to do this. All you have to do is quite literally write the parameters as key/value pairs, like shown above. What it captures is what molecule it is, where the data came from, and what each parameter value is.
A note regarding dipole moments: typically these are the hardest things to find because dipole moments are hard to experimentally measure. The task then becomes finding a theory paper that calculates these, and determining which level of theory to trust. The general rule of thumb is the more letters it contains the better. If you see CCSD(T), take that value.
Pyridazine
This molecule takes it to the other side of things, as an asymmetric top with many parameters. Grabbing the data from Esselman et al.:

The same treatment is given and formatted like this:
# pyridazine.yml A: 6242.95041 B: 5961.09410 C: 3048.71363 Delta_J: 0.75676e-3 Delta_JK: -0.13558e-3 Delta_K: 0.82976e-3 del_J: 0.315715e-3 del_K: 0.68234e-3 Phi_J: 0.000233e-6 ... Phi_K: 0.002081e-6 rep: "Ir" name: "pyridazine" formula: "C4H4N2" smiles: "C1=CC=NN=C1" notes: "No hyperfine structure" author: "IRC"
I left out the in betweens and the DOI because I can't be stuffed. The last three values are not actually Hamiltonian parameters, but the inertial defect, asymmetry parameter, and RMS respectively.
Frequently asked questions
I have XYZ parameters in this paper, how do I code it up?
If it looks complicated, it probably is—ask for help from one of the resident spectroscopists.
In the near term, for simplicity's sake overly complicated Hamiltonians are not going to be captured in the scope of this hack-a-thon. This is especially true if you see cosine terms and whatnot (i.e. large amplitude motions) in the table, but feel free to ask if unsure.
In terms of writing them down in the YAML format, write the parameter names as you would in LaTeX. Here's a non-exhaustive list of parameters currently implemented:
Parameter | Coding |
---|---|
$A,B,C$ | A, B, C |
$D_J,D_{JK},D_K$ | D_J, D_JK, D_K |
$d_1,d_2$ | d_1, d_2 |
$\Delta_J, \Delta_{JK}, \Delta_K$ | Delta_J, Delta_JK, Delta_K |
$\delta_J, \delta_K$ | del_J, del_K |
$H_{J},H_{JK},H_{KJ},H_K$ | H_J, H_JK, H_KJ, H_K |
$h_1,h_2,h_3$ | h_1, h_2, h_3 |
$\Phi_J,\Phi_{JK},\Phi_{KJ},\Phi_K$ | Phi_J, Phi_JK, Phi_KJ, Phi_K |
$\phi_J,\phi_{JK},\phi_K$ | phi_J, phi_JK, phi_K |
$\chi_{aa},\chi_{bb},\chi_{cc}$ | chi_aa, chi_bb, chi_cc |
Be careful not to mix and match A-reduced and S-reduced names for parameters. In terms of the structure of a YAML format for us, all key/values are optional, and so if your molecule is linear don't bother putting down A and C. For more than one nucleus with spin, add "_1", "_2", etc. for each nucleus. For the remaining fields, be as consistent as you can:
notes: "Write anything noteworthy about the molecule/paper here" doi: "Write the DOI for the paper here" smiles: "Really helps nail down the molecule without a name" name: "If you really want to, add a common name" formula: "Write the molecular (not structural) formula" author: "Initials of the person who wrote this" ...
where ... are just all the parameters relevant to the molecule.
What do XYZ parameters mean?
Here's a short list of the terms you are going to look for:
Parameter | Description | Molecule types |
---|---|---|
$A,B,C$ | Principal rotational constants; axes correspond to the most (a) and least (c) mass respectively. | Linear molecules only have B, symmetric tops have B and C, and asymmetric tops have A, B, C. Some times asymmetric tops only have B and C. |
$D_J, D_{JK}, D_K, d_1, d_2$ | S-reduced, quartic centrifugal distortion constants | Linear molecules will only have D_J |
$\Delta_J, \Delta_{JK}, \Delta_K, \delta_j, \delta_k$ | A-reduced, quartic centrifugal distortion constants | Read notes for above. For the YAML file, write them as you would write them in LaTeX. |
$H_{JK}, \phi_J$ and so on | Sextic rotational constants, probably not that common | Asymmetric tops |
$eQq, \chi_{aa}$ | Nuclear quadrupole terms; former is for linear molecules, latter is the tensor element for the a-axis which is most common. | Molecules with atoms that have non-integer nuclear spin like nitrogen, chlorine |
$\gamma, \epsilon$ | Electron spin-rotation interaction terms; former is for linear molecules, latter will have the projection axis as subscripts | This is about as complicated as we'd like to get, which are for radicals. |
If there are terms not in this list, ask a spectroscopist but probably not worth the trouble. Consult this PDF for some of the more esoteric symbols. If it has cosine and whatnot, it's safe to ignore.
How do I know what kind of top molecule X is?
The type of rotor depends on the number of unique inertial axes a molecule has—basically how many unique ways you can make a molecule rotate. If a molecule has equivalent rotations, then it's not going to be an asymmetric top. On the other end, a linear molecule only has one way to really rotate, which is why only B is defined.
For asymmetric tops, there are six different choices (Ir, Il, IIr, IIl, IIIr, IIIl) for the axis system of the molecule, corresponding to a, b, c assigned to z, y, x and all permutations. This matters! Look for this value in the paper and include it under the "rep" keyword. SPCAT only handles two cases: Ir and IIIl. If you don't specify one of these two cases, sewers
will just choose one of the two based on the asymmetry of the molecule, but this is not great... If you know how to convert the constants between representations, go ahead and do that. This functionality may exist in a future version of this codebase.
How do I know what terms to expect?
As with all things to do with molecules, it's good to build an internal picture of some "template molecules" to use as a basis for things like symmetry classification, and what parameters to expect you'll need.
For example, H2CS is very similar to H2CO, and so one would expect the parameter encoding to be very similar. Another example would be CH3NH2—maybe a little less obvious, but it is similar to ethanol (CH3OH), and so you would expect the same kinds of terms needed, except with a nitrogen. Coincidentally, both have large amplitude motion from NH2 and OH, and so their Hamiltonians are going to be fairly complicated.
This is useful to know if you need to do SPCAT stuff, as you can simply refer to Holger's emporium of molecules and choose a molecule most similar to the one you're interested in.
Do I need hyperfine structure?
The short answer is yes. We want everything to be as systematic and future proof as possible.
Depending on the atom and how its bonded, you should generally consider it. The CN molecules that are typical to TMC-1 have their hyperfine structure collapse as you go to higher J values, so that it becomes unresolved at higher frequencies. The splitting is also proportional to the magnitude of the hyperfine coupling term, and decreases from its atomic value with more and more bonding partners. For example, the quadrupole term ordering would be N >> CN >> NH >> -CNH, and therefore as you go along that line the hyperfine becomes less easily resolved and "important" for the spectroscopy.
How do I find the SMILES code?
Generally, searching the molecule name on PubChem and/or ChemSpider will find you something. However, not all molecules are known, and might not have publically available SMILES codes—typically true for radicals or really transient molecules. Ask Kelvin if you can't find it.
Which SMILES should I use?
Canonical SMILES—it is as unique as you can get.
Which paper should I get parameters from?
This is a lot more nuanced, and generally comes from experience. Newer is generally better, as it usually folds in data collected over many years. That said, there is a hierarchy of experimental techniques in terms of resolution—maybe not necessarily something you'll pick up over night, but as you're looking into papers pay attention to the resolution and frequency range of the data source, and match it to the frequency range of your data. For GOTHAM, Fourier transform cavity spectroscopy will yield the best in class data, providing the highest resolution and accuracy typically attainable, which will typically be in papers after the '90s.
Another aspect that is relevant to us is hyperfine structure. At these low frequencies and low J values we will typically resolve them, and measurements in the mm-wave will generally not include them because they are unresolved. This may be a situation where parameters from an older, high resolution/low frequency paper are more relevant than a newer high frequency paper.
Do you care about uncertainty?
For GOTHAM, no. On a personal level, very much so.
Which level of theory for theoretical data?
Usually a pretty nuanced thing, and you can look up this paper to see if the method you're looking at is listed.
Generally speaking, the method (the thing before the slash / ) is pretty invariant, and to get dipole moments with low uncertainty you just need large basis sets (the thing after the slash). With a cc-pVQZ basis, the uncertainty will be on the order of +/-0.3 Debye.
In terms of how to judge one basis over another, basis sets are designed based off angular momenta $\zeta$, with larger being better:
Dunning basis: cc-pVXZ, X = D, T, Q, 5, 6
Pople basis: X-YZG, 3-21G < 6-31G < 6-311G
With Pople basis sets, dipole moments are useless without polarization functions—if it doesn't have at least a (d) at the end, their uncertainties will be wild. These basis sets can also be larger with diffuse functions added to them, which are denoted with +'s.