DocVac Dictionary of Jargon
In describing how DocVac works, we use a lot of jargon, mostly relating to the underlying structure of how our document extraction works. In an effort to explain at least some of it, here's a mini dictionary:
Optical Character Recognition or OCR - a program that takes an image and tries to extract characters (letters, numbers etc) from the image. High quality images and clear text is needed for this to work, and there are always a certain number of errors, particularly with things like the letter O being interpreted as a 0 (zero) etc.
PDoc or pd - orginally PolicyDoc, this is one document, a PDF or some other sort of file in our system
ExtractionMode or ExtrMode - there are a few different modes of extracting data, 1 = extract text from a PDF, 2 = extract contents from a PDF using OCR, 3 = extracting contents from a non-PDF file using OCR
PDocRuntime or pdr - there are one or two of these per PDoc, each representing an effort to extract data with a different ExtrMode. For a DocVacBasic user uploading a file, two pdr's for one PDoc are most commonly seen when text extraction from a PDF document fails, and a subsequent information is made to extract text using OCR.
more