Project Information

Namedstats
CategoryLibraries-System
Status3 - Alpha
Short DescriptionStatistics library for D, focused primarily on descriptive statistics and non-parametric hypothesis testing.
Long Description
Forum/forums/viewforum.php?f=222
Home Page

dstats attempts to provide some basic (and some not-so-basic) statistics functionality as a plain old library on top of D, as well as a bunch of utility code useful for implementing higher level statistical functionality. The intended use is for large datasets or for integrating with things that require more general purpose programming than a statistics package with a scripting language on top of it (such as R) can handle gracefully. It's geared towards programmers who need to do some statistics, not statisticians who need to do some programming.

dstats currently targets the bleeding edge of D compilers, as this allows me to push the boundaries of what APIs can be created and, frankly, makes the project more fun. Currently, it only works properly on recent versions of the DMD2 compiler, often only the most recent. For now, no effort will be made toward compatibility with older compilers. Once D2 is stable, it will probably settle into stable D2 rather than moving to D3.

So far, dstats features the following modules:

dstats.all - Convenience module that publicly imports everything else.
dstats.alloc - Custom memory allocation routines to speed up certain functions that require scratch space.
dstats.base - Relatively low-level primitives that other modules build on.
dstats.cor - Pearson, Spearman, and Kendall correlation, and covariance.
dstats.distrib - PMF/PDF, CDF and inverse CDF functions for several elementary distributions. To give credit where credit is due, large portions of this module were borrowed from Don Clugston's MathExtra?, which Tango also uses. I did this because I wanted dstats to be reasonably self-contained rather than have a zillion dependencies.
dstats.gamma - The gamma function module borrowed from Tango.
dstats.infotheory - Entropy, mutual information, conditional mutual information.
dstats.kerneldensity - Computes kernel density estimates for empirical probability distributions.
dstats.random - Random number generation for several elementary probability distributions.
dstats.regress - Linear regression. Makes heavy use of D's ranges to provide an interesting and unusual API.
dstats.sort - Sorting algorithms with some added features that are useful for non-parametric statistics calculations.
dstats.summary - Summary statistics such as: mean, median, standard deviation, skewness, kurtosis.
dstats.tests - Hypothesis testing, such as T-tests, Wilcoxon tests, and Kolmogorov-Smirnov tests.

Status

dstats is currently in alpha, but getting close. It's got plenty of well-tested code, but I don't want to declare it beta yet because, at least until D2 is finalized, I still reserve the right to make breaking changes to the API without notice. However, if you accept this caveat, then the level of testing, IMHO, makes it beta quality.

TODO/Help Wanted

1. File bug reports, including enhancement requests for missing functionality.
2. Several non-parametric hypothesis tests use the asymptotic approximation only and lack exact p-value calculations. The exact calculations are very hard to implement efficiently. If someone with very good skills in combinatorics and dynamic programming would like to give it a try, exact p-value calculations are needed for Spearman correlation, the runs test, and the Kolmogorov-Smirnov test. Also, a good implementation of Fisher's exact test for larger than 2x2 contingency tables would be nice. (I have a bad implementation laying around, but didn't include it because it's so slow.)
3. The current implementation of the hypergeometric distribution relies on normal and binomial approximations for large arguments, to allow for scalability. These are reasonably accurate but not great. If anyone knows of an O(1) algorithm for calculating these probabilities, it would be greatly appreciated.
4. As discussed on digitalmars.d, eventually merge this code with other math code on dsource into a comprehensive library.