I am a student in a new bioinformatics program. I am working on classifying protein sequences with support vector machines. One some of the descriptors I am using takes into account things like protein mass and percentages of certain amino acids Physico-Chemical Property Composition. This means many of the values in the descriptor will be between 0-1 while others will be between 0 - and very large. How important is it actually to normalize this between -1 and 1 (my PI thinks it is essential but will not mechanistically explain why.)? Also how can I account for data outside the bounds of my training data? I could use hard coded values for max/min in my programs but this seems to defeat the purpose of normalization. I would really like to understand the why behind what I am working on.
[link][5 comments]