Sample size selection for record linking on 5 billion documents

Hey all, I'm currently working on a record linkage task involving unstructured strings against a structured database. We're using elastic search as a base tool and then a host of ML techniques to make the system perform better. I'm curious as to how large our sample would need to be to be representative of the 5 billion strings that need to be processed.

My thoughts are that we could use scikit learn to turn the strings into tokens and see the distribution of those tokens. My thought is, as more strings are loaded in, the distribution of the more common tokens will stabilize. Eventually, new tokens that are found will amount to less than 1% of the tokens that have been found so far, and at that point we should be able to say that however many strings that have been loaded in is a good representative example of the variation in the data.

What do you guys think? Any other ideas?

http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

submitted by iwantedthisusername
[link][comment]

Sample size selection for record linking on 5 billion documents

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112