DocVac Dictionary of Jargon



In describing how DocVac works, we use a lot of jargon, mostly relating to the underlying structure of how our document extraction works.  In an effort to explain at least some of it, here's a mini dictionary:

Optical Character Recognition or OCR - a program that takes an image and tries to extract characters (letters, numbers etc) from the image.  High quality images and clear text is needed for this to work, and there are always a certain number of errors, particularly with things like the letter O being interpreted as a 0 (zero) etc.

PDoc or pd - orginally PolicyDoc, this is one document, a PDF or some other sort of file in our system

ExtractionMode or ExtrMode - there are a few different modes of extracting data, 1 = extract text from a PDF, 2 = extract contents from a PDF using OCR, 3 = extracting contents from a non-PDF file using OCR

PDocRuntime or pdr - there are one or two of these per PDoc, each representing an effort to extract data with a different ExtrMode. For a DocVacBasic user uploading a file, two pdr's for one PDoc are most commonly seen when text extraction from a PDF document fails, and a subsequent information is made to extract text using OCR.

 


Last modified: 4/14/2021
Other articles:
Anonymous Mode Email
Web Services - Ws - PDocDetailApi.GetPddList
Combining Multiple Docs into One Doc
Billing - DocVacBasic & DocVacGold
Setup Docs
Excel to Consume Web Services
CSV Files
Financial Statement / Table Extraction
Key Term Search with Wildcards
DocVac Dictionary of Jargon

more