I work for a large National library, in the digital arena. Almost every day I come across digital text objects that belong to sets, that we are describing (in a library kind of way).
It strikes me that we could be doing things with ML, especially where we have a large set of related files, and the single description we use in library land is 'collection of 'things' from producer 'A' from Jan 1993.'
This collection set might be a hundred text files. I would love to demonstrate to my library colleagues the benefit we could get by automatically extracting key terms (dates, names, places etc) and producing some basic keyword based summary of the collection to augment the chronological record we produce at the moment. An added advantage of running an ML tool over the set would be a I guess to create a full text index that could also be searched - allowing users to search for their own keys amongst the collection.
Does this sound (1) plausible and (2) achievable - especially if I do it myself, as one with only rudimentary python and even less MYSQL at the end of their finger tips.
Suggestions, corrections, mild abuse (where appropriate) and offers of assistance all gratefully received.
[link] [8 comments]