Source code (searching for statistical dependency/association rules)
The following program packages contain implementations of my
algorithms. All of them search for non-redundant statistical dependency
rules using different measure functions and two different notions of
redundancy. In practice, dependency rules are similar to
association rules, but they always express a statistical dependency and
there is no requirement for minimum frequency. For redundancy, I have used
- Classical definition: X -> A is non-redundant with measure M, if
all more general rules Y -> A (Y is a subset of X) have M(Y -> A)<=M(X -> A).
- Strict definition: X -> A is non-redundant with measure M, if
all subsets of XA can produce only poorer rules, i.e. for all YB, which
is a subset of XA, M(Y -> B)<=M(X -> A).
The packages contain C-source codes and instructions how to use and
modify the programs (README file). All programs are written for Linux
and gcc compiler. They may work in other environments, but nothing is
guaranteed, and I cannot help you with any windows-specific problems.
Kingfisher version 1.2b
New useful constraints and Mutual information measure added. See instructons and tips
Kingfisher version 1.1. Searches for the best classically non-redundant
dependency rules (both positive and negative) with the p-value from
Fisher's exact test or
the chi2-measure. Especially, the search with Fisher's exact test is
effective and scales up to really large-dimensional and dense data
sets. The search with the chi2-measure is less effective and the
results are typically less accurate (see experiments in my
searches for the best classically non-redundant statistical depenency rules with the
chi^2 measure or the z-score. The resulting rules are non-redundant in
the classical sense (i.e. a rule X ->A is non-redundant, if all more
general rules Y -> A, where Y is a subset of X, are poorer).
searches for the best strictly non-redundant dependency rules. The
goodness measure can be either the chi2-measure or the
z-score. Notice: this is more effective than StatApriori, but it
searches for only the best Q rules. However, you can always set a
large Q, together with a desired threshold for the goodness measure.
StatApriori: a breadth-first search for all sufficiently good,
strictly non-redundant dependency rules with the z-score.
- DeepClue searches for the same rules as StatApriori, but in a depth-first
manner. It is less efficient, but saves memory which can be more critical on
dense data sets.
Both of these can be further optimized!
- Namescodes transforms a transaction file with nominal attributes to numeric and the resulting rule file back to nominal.