Hello, I have this side project where I crawl the local news websites in my country and want to build a crime index and political instability index.
I have already covered the information retrieval part of the project. My plan is:
- Unsupervised topic extraction.
- Near duplicates detection.
- Supervised classification and incident level (crime/political - high/medium/low).
I will use python and sklearn and have already research the algorithms that I can use for those tasks. I think 1 - 2 could give me a relevancy factor of a story: the more news papers publish about an story or topic the more relevant.
My next step is to build the monthly, weekly and daily index (nation-wide and per cities) based on the features that I have, and I'm a little lost here as the "instability sensitivity" might increase to the time. I mean, the index from the major instability incident of the last year could be less than the index for this year. Also if to use fixed scale 0-100 or not.
I would appreciate any pointer to a paper, relevant readings or thoughts.
Thanks.
[link][10 comments]