I have a question about clusters that I am contemplating to treat with a non parametric mixture approach (I think).
The problem is the following:
I am working on the explanation of human comportment. Each row of my database contains : 1) the id of someone 2) some parameters of the environment X (exemple : the temperature, the wind etc.) 3) a binary variable Y representing the person's reaction to the parameters (exemple : get sick or does not get sick because of the weather). My idea (based on intuition and not on data) is that we can gather people in a finite number of groups so that in a group, people have the same reaction to temperature (some are easily sick, others are never sick...). In a given group, more formally, the law of Y conditionnaly to the parameters X is the same.
I have no idea of the law of Y conditionnaly to X. For the parameters X, I can do some hypothesis if necessary.
I would like to create some cluster of people "having more or less" the same reaction to parameter. Besides, I would like to predict the reaction of a given person to a given value of the parameters (even if this event has never happened in the database).
It seems to me that we can treat the problem like a non parametric mixture model. Indeed, as I don't have hypothesis on the conditionnal law of Y, I think I will have to create it with kernels method for instance. I have found this paper : http://biometrics.cse.msu.edu/Publications/Clustering/MallapragadaJinJain_NonparametricMixtureModels_SSSPR10.pdf Besides, it seems to me that, in this case, each row of observation (X_i, Y_i) is not a simple realization of some random variable but X_i is a realization of a random variable and Y_i is a realization of a random variable conditionnaly to X_i. I don't know if it makes a difference.
Precision : I have around 100.000 rows , the vector X_i has some components discrete and others continue.
So, I am wondering :
is my approach correct ? would you advise me another point of view for this problem
I would be very interested in any references about it.
Don't hesitate to make me reformulate the problem statement.
Thank you very much in advance.
[link][comment]