If I have a large set of fixed length strings (billions), what are some fast ways of identifying each string's near neighbors under either a Hamming distance or edit distance? Either exact or probabilistic methods are interesting to me but I want to avoid spending more than nlogn time on index construction and no more than klogn time finding each set of neighbors (where n is the number of strings, k is the number of neighbors). Any links or paper recommendations appreciated.
[link][17 comments]