I am running a prototype which I developed in Java to perform LDA on a few million documents in Java. I have personally found it very useful as most LDA implementations in Java or R or Python either run out of memory for a few thousand documents or run down to a crawl.
I am planning on open sourcing it but I still have to add the licensing text in my source files and create some documentations. I was curious if there would be any interest in such as library. Or are people using LDA content with what is out there in the Open Source space.
Edit : Forgot to add that for 500 topics on 2 million documents I am getting a performance of approximately 5 hours for 1000 iterations on EC2 High Memory Instance with Java Max Heap Memory set as 10GB.
[link][comment]