Friday, April 26, 2013

Determining the distribution of multiple samples in python.

I am currently working on a probabilistic model for assigning peaks from protein NMR spectra to the corresponding spin-systems of the protein. So far I have assumed that the error for different measurements of the same chemical shift follows a normal distribution.
Analyzing a set of already assigned peak-lists for the protein S6, it is clear that there's more outliers than a normal distribution predicts, which justifies a deeper analysis of the data. However even if I assume that the error for e.g. all amid protons in the HNCA spectrum follows the same distribution, the samples variance might be different for the HNcaCO experiment.
A full analysis leaves me in this case with 45 different samples that most likely follow the same kind of distribution with differing parameters. So the question was how to continue from here. Luckily I found this blog post.

What the author did was to brute force all 80 distributions that scipy.stats offers and list the p and D values from a Kolmogorov-Smirnov test for a single sample. I ended up improving the code to support multiple samples, as well as displaying the log likelihood of the fit. (Any distribution that fits extremely poorly with a single sample is removed from the output)

The code can be found here: distribution-check.py

If all your samples are in the folder samples/ with the extension .txt, then simply run

python distribution-check.py samples/*.txt

As an example, here's the output for running the script on my 45 samples:

Apparently a normal distribution is not a very good 2-parameter representation of my samples, even though the simplicity of it might still be beneficial, so I will probably end up testing a few others. The -inf values probably occurs due to asymptotic behaviour at the mean and I would suggest just ignoring them, and use the KS test instead to determine if its a good fit