Hey all, I'm currently working on a record linkage task involving unstructured strings against a structured database. We're using elastic search as a base tool and then a host of ML techniques to make the system perform better. I'm curious as to how large our sample would need to be to be representative of the 5 billion strings that need to be processed.
My thoughts are that we could use scikit learn to turn the strings into tokens and see the distribution of those tokens. My thought is, as more strings are loaded in, the distribution of the more common tokens will stabilize. Eventually, new tokens that are found will amount to less than 1% of the tokens that have been found so far, and at that point we should be able to say that however many strings that have been loaded in is a good representative example of the variation in the data.
What do you guys think? Any other ideas?
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
[link][comment]