DocVac Use Cases
Some common usage scenarios (use cases) where DocVac can be helpful:
1. Financial statements, or other documents containing tables that can be run in table extraction mode to extract the data and make it available both with a web interface including the ability to download as csv and web services to access programmatically.
2. Large numbers of fairly similar documents are currently processed by humans e.g. prior insurance policies for an auto insurance company. The documents have a fairly standard set of fields and the data can be extracted by DocVac with substantially less human effort, driving down cost.
3. Large numbers of very different documents which are currently processed by humans, e.g. insurance claim file documents containing time limited demands , voluminous documents obtained through discovery by a law firm. The combination of key words and data elements in the documents can be used to prioritize which documents are the most important for human review, and within each, which pages are likely the most important.
4. Large numbers of old documents, which incur storage charges, may conflict with corporate document retention policies and which may be needed to support new IT systems where the combination of date and other fields can be used to determine which can be discarded.
5. Large numbers of new documents, which are too expensive to process by humans, but which contain data valuable for competitor/market research and other corporate purposes.
Other Usage Considerations
A. Document processing time depends on many factors, particularly document complexity and on server utilization in the cloud. Generally, we'd expect processing times of 5-15 minutes with a somewhat smooth inflow of incoming documents.
B. A mixture of image types is supported, including PDFs with extractable text (these will provide the best results), PDFs without extractable text and other forms of images e.g. JPG file from phone photo, multi-page TIF file from scanning.
C. Documents with handwritten text won't extract the handwriting.
D. Photos of text should be taken straight on, not at an angle.
E. Images with a lot of different font types and sizes, areas of shading etc are more difficult to process accurately than more uniform documents.
F. Some documents are so absolutely stuffed with boxes, obscure formatting and lots of data fields in the same row that they require custom coding to extract the data.
G. Most public companies provide an earnings release natively in pdf format, this will generally work better than saving an html version in pdf format. Html files that are relatively free of extraneous junk e.g. nasdaq.com that are saved as pdfs will usually work better than complex web pages bursting with content. Extracting information filed with the SEC - 10-K, 10-Q etc mostly won't work, parsing out information from the XBRL (XML) format documents filed is a much better approach.