I have some flex time where I can pursue my interests at work as long as I can apply it there. Recently I obtained ~100,000 invoice records over two years.
Can I use this as training data for a decision tree to analyze new invoices for their chance of being paid late?
I've been using the scikit for python and did the provided sample iris Hello World and exported it to a pdf. It's clear I need to learn more but this is how I pictured my solution so far:
1) Prepare my dataset.
CSV in the format:
Late | Vendor Name | Department | Invoice Month | Invoice Day 0 | Vendor1 | 1 | 1 | 20 1 | Vendor2 | 2 | 1 | 18
etc.
2) Load in to scikit w/ numpy or pandas
3) Run just like the iris/digits dataset?
Is this process wrong? Will I be able to use Vendor names (strings) instead of integer values in scikit just like the provided iris dataset or will I need to make further changes to account for that?
I appreciate any feedback or references to help guide me towards a solution.
[link][4 comments]