To this point, how many documents did you put eyes on compared to the total number of documents in the data set? (Your argument that this would be the logical stopping point in a real production is compelling so the cost-benefit calculations should be based on reviews to here as well.)

If I’m following the thread correctly, you tagged 2663 for training and another 939 for QA but I haven’t been able to tally the ones you read and rated but excluded from training. For the sake of argument, I’ll swag the total of documents reviewed at 5000. That works out to 96 docs per hour or $5.2 per document. Those seem low given a) the doubled rate and b) the high standard of review that you conducted.

However many you put eyes-on, I think that number is likely to be close to a fixed cost for a production. That is, for any data set of similar distribution, you will have to go through about the same number of iterations and put eyes on about the same number of documents whether there are 100,000 in the total population or 100,000,000.

Okay, it’s not perfecly fixed since population size is a factor in the original sampling but it doesn’t scale linearly, either.

My point is that your 92% cost improvement and the extrapolated 13,444 documents per hour are artifacts of the total population size as much as anything else. If the data set had started with only 300,000 documents, your cost savings compared to a linear review would still have been positive but far lower.

I think this is an important line of reasoning to explore because it might yield some benchmarks about cases that are too small to justify the cost/effort of predictive coding.

LikeLike

]]>LikeLike

]]>