Firstly forgive me if my terminology is not correct as this is a field for which I am certainly no expert but basically taken a dive. Thus expect some oddities below if you have a reasonable level of expertise. Information below is fairly sparse but some of you may still have thoughts that might point me in the right direction.
I have been working on a fruit picking project. One of the challenges to this project has been developing an imaging system that can determine what is fruit and what is not fruit. At a glance it seems fairly easy. I really only need to identify one thing- being weather a pixel in an image is fruit and do this for all pixels in an image. Fruit does seem obvious by colour on the tree. However after working for some time on the problem I still have not achieved a reliability level that is in my mind high enough. It is important to keep in mind a very high reliability level is required for commercial feasibility. It is “close” but not quiet there yet and thus I am hoping for some advise from people who might be interested.
So far it seems Identifying fruit is fairly challenging in the sense that:
Real world imaging can be noisy. White balance is not always consistent. Camera dynamic range is limited and over exposure(heavily effects colour of the region) and sun flairs are common. Lighting is not controlled so strong shadows are evident.
Fruit is regularly obstructed by leaves. It has little shape and little structure. Camera resolution is not high enough to detect the very subtle texture on the fruit that would make it easier to identify. That said I can readily identify what is fruit from the image myself so there should be enough data in the image.
I can see that if I change an image to black and white then even for myself it is not obvious which parts are fruit are which parts are leaves. Yes unobstructed fruit is obvious but much that has partial leaf cover is not obvious and easily mistaken. Thus colour is certainly important but has limited dependability.
The system must run in real time(preferably 50ms or less per frame) and its done on a per pixel basis.
Currently I am using a self developed hardware accelerated convolutional neural network. Thanks to some prior experience in hardware acceleration it runs fairly fast and thanks to the repetitive structure of convolution neural networks it is repeated very efficiently across an entire image.
I have methods for training it in mini batches via back propagation with momentum, rprop and rmsprop as per obvious literature suggestions. So far I have not implemented dropout(had it in early version but got lost in some optimisations). Activation functions include sigmoids, tanh and RELU units. Typical structures I try are 5 to 8 layers deep. Map sample size is 5x5 as I cant see much point making it larger due to the lack of structure in fruit obstructed by leaves(wrong maybe- to small?). With some variability I am now trying 90+ maps per layer as lower counts seemed to get to a certain level of reliability but still mis detect some leaves as fruit(for example) in cases where I would expect it to do better. Thus currently I have 10 to 30million connections and about 200k of unique weights used to determine if a pixel is fruit or not(if my serviette calculation is not wrong). To be this seems significant, as if it should be enough, specifically given the lack of structure but perhaps I am wrong?
Something I notice is that error levels during training initially drop very fast. I guess its probably locking onto colour and simple gradients but then the error rate starts to fall very slowly. Slowly in terms of days training for small gains. So far I dont seem to be over learning but still stuck with under learning. Training wise I am using a radeon R9 280 and its all running 32bit floats which should give an indication of processing power being used.
One query I have is with regards to what you train too. Literature suggest train to non linear portions of the activation function. How about RELU? For 0 do you train to 0 ,or some variable above or below. For the low side so far I suspect slightly above 0 is best to reduce units from dropping out due to never being activated. What about secondary non linearities in the target. Do you train such to an exact target value or to a value that is above or below the given target(so far I am trying both) for example in the non linear case if target is 1, and output is 1.5 and then I dont feeback an error as that to me is good, its above the target.I would however feedback that error if it were below 1. A similar thing is done for the “off” case? Thoughts?
Otherwise I have also started to look into odd areas that might help like considering other options other than simply adding inputs before the activation function. One that appeals to me from work in another field is something like a numerical or of inputs. Instead of for example Sig(A1+A2) you have Sig(A1+A2-A1*A2,..gets more complicated for more inputs..). It is interesting to me because at different magnitudes its acts differently(ie much less than 1 it is more like just A1+A2 for example and near 1 it acts more like an “or”, above 1 more like xor). I suspect that the xor nature when inputs get large might be useful for forcing input maps to be more unique and more orthogonal(ie force faster convergence on something that might resemble an auto encoder) but am still working on the maths. Has there been any significant academic investigation into such concepts that anyone can point out?
For me the obvious area would be do you think the structure is appropriate? Otherwise any thoughts are welcome. Perhaps I am just not training long enough? Thoughts?
[link][comment]