I am developing a new method and want to evaluate it under a range of condition conditions, such as favorable, random, and adversarial datasets.
However, my experience thus far has been almost entirely using real datasets that I was either given or data I collected myself.
Additionally, the required dataset is somewhat abnormal in that while the labels and features are the same for all samples, each sample belongs to a distinct "type" which could best be described as a cluster. Thus, though a dataset may have 500 samples that look the same to the system, we really have 10 sets of 50 samples which each have 25 positive and 25 negative cases (assuming binary classification).
A real world example of this might be a dataset of cell images where the labels are cancer (1) or no cancer (-1), but many distinct types of cancer exist.
My current approach is the following, shown for a 1D feature vector. Note that each time we add some simple random noise after the samples are created:
Favorable: Separate the problem space into 10 blocks ([0,1], [1,2], ... [9,10]) and within each block is a probability density function (most likely gaussian) which produces samples.
Random: Same as favorable but the without a separated problem space. In other words, each probability density function is located anywhere [0,10] and they may even overlap.
Adversarial: Probability density functions all significantly overlap with other density functions.
Does this sound correct, or am I totally off on how I am supposed to be creating this data?
[link][7 comments]