Quantcast
Viewing all articles
Browse latest Browse all 62787

Looking for algorithms to 'sync' recordings of speech

Let's say I have recording of two persons (A & B) reading the same block of text. Of course their voices sound different, they read at different speeds, etc. But let's assume:

  • they read exactly the same text
  • they have the same accent
  • there's no coughing, hiccups, major background noise, etc.
  • no major reading errors

Basically, we can assume both the reading and the recording are pretty 'clean'. What's a good algorithm to 'sync' up the recordings ?

By 'sync' I mean you somehow break up the recordings into a sequence of 'sound bites', and the two recording share the same sequence, and the algorithm should output the time each speaker spent on each sound bite. For example:

  • sound-bite-0: t=[0.1,0.3] for A; t=[0.11, 0.28] for B
  • sound-bite-1: t=[0.32,0.39] for A; t=[0.29, 0.35] for B

Each 'sound bite' can be a word, a syllable, a group of words, etc, but let's say sound bites should be between 0.1 to 2 seconds (or any reasonable range) in length.

There's no need to do speech recognition on the recording, in fact if they read a list of unintelligible, random syllables, the algorithm should still work. Bonus point if the algorithm works on songs too.

submitted by by_321
[link] [5 comments]

Viewing all articles
Browse latest Browse all 62787

Trending Articles