Let's say I have recording of two persons (A & B) reading the same block of text. Of course their voices sound different, they read at different speeds, etc. But let's assume:
- they read exactly the same text
- they have the same accent
- there's no coughing, hiccups, major background noise, etc.
- no major reading errors
Basically, we can assume both the reading and the recording are pretty 'clean'. What's a good algorithm to 'sync' up the recordings ?
By 'sync' I mean you somehow break up the recordings into a sequence of 'sound bites', and the two recording share the same sequence, and the algorithm should output the time each speaker spent on each sound bite. For example:
- sound-bite-0: t=[0.1,0.3] for A; t=[0.11, 0.28] for B
- sound-bite-1: t=[0.32,0.39] for A; t=[0.29, 0.35] for B
Each 'sound bite' can be a word, a syllable, a group of words, etc, but let's say sound bites should be between 0.1 to 2 seconds (or any reasonable range) in length.
There's no need to do speech recognition on the recording, in fact if they read a list of unintelligible, random syllables, the algorithm should still work. Bonus point if the algorithm works on songs too.
[link] [5 comments]