Hi everyone, sorry if this isn't the right forum for this, but I have some really newbie WEKA questions.
So i'm trying to process some raw data into the ARFF format so I can experiment on it. I figured i'd go whole-hog right from the start, so I downloaded the TREC spam dataset from 2008-2009 from university of waterloo, and used ubuntu linux's htmltotext converter to convert them all (75,000) to text files, and remove the html tags.
My next step was try and use an old tool someone wrote in 2002 called TextDirectoryToArff , whose source can be found here: http://weka.wikispaces.com/ARFF+file...xt+Collections.
So I loaded it all up in eclipse, added the external weka package and it tells me that line 59: data.add(new Instance(1.0, newInst));
isn't valid, because Instance cannot be instantiated.
My questions:
Is it worth even compiling TextDirectoryToArff, or am I misunderstanding how to go about converting raw text data into an arff file?
If this is the right tool to be using for the job, what am I doing wrong with the file?
Thanks in advance.
[link] [11 comments]