Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62811

Mining frequent itemsets in textual documents (storage issues)

$
0
0

Hi guys,

I am working on a research project that aims to scan a large number of documents and identify itemsets in the form of word sequences. Another team is working in the same task using Markov Chains and we will later compare our approaches.

The problem is that the text corpus we are mining is extremely big. We are dealing with about 19 GB of text files. Whenever we detect an itemset (where k <= 3) we store the information on a relational dbms together with its support count.

However, the tables in our relational dbms get pretty big pretty quickly and it takes a lot of time to query our database. Our queries only search by the first word in a sequence (the order of words matters in our case).

Does anyone have any experience with similar issues? Is it feasible to try with NoSQL databases or Graph databases maybe?

submitted by vshehu
[link][7 comments]

Viewing all articles
Browse latest Browse all 62811

Trending Articles