I recently got interested in speech recognition and have implemented a simple dynamic time warp system for word recognition for my own learning purpose. However after testing a bit I believe that I might made a mistake somewhere in the implementation, as the least distance value in traversing the dtw matrix does not accurately separate different word from each other. Here is the step I followed in the implementation.
Compute mfcc for two wav samples using https://github.com/jameslyons/python_speech_features (I will eventually replace it with my own mfcc algorithm)
Compute l2 norm for the top 13 mfcc in order
Traverse through the dtw matrix and find the least distance.
Below are my results
Comparing 1 of the kiwi file to all other file I get the following average
kiwi 53.5627956541 apple 52.8226506157 banana 57.885524018 lime 48.5113003162 orange 63.9675030969
Here is my code https://gist.github.com/anonymous/1323692feea2a2bcfba4
I feel I am missing some crucial steps, any advice will be appreciated.
Thanks
Edit:
I am using audio file from
https://dl.dropboxusercontent.com/u/15378192/audio.tar.gz
and
[link][4 comments]