Retrieving information from documents and forms has long been a challenge, and even now at the time of writing,organisations are still handling significant amounts of paper forms that need to be scanned, classified and minedfor specific information to enable downstream automation and efficiencies. Automating this extraction and applyingintelligence is in fact a fundamental step toward digital transformation that organisations are still strugglingto solve in an efficient and scalable manner.
An example could be a bank that receives hundreds of kilograms of very diverse remittance forms a day that need to beprocessed manually by people in order to extract a few key fields. Or medicinal prescriptions need to be automated to extract theprescribed medication and quantity.
Typically organisations will have built text mining and search solutions which are often tailored for a scenario,with baked in application logic, resulting in an often brittle solution that is difficult and expensive to maintain.
Thanks to the breakthroughs and rapid innovation in the machine learning fields of Computer Vision and NaturalLanguage Processing (NLP), reliable options are now available to provide data driven solutions that generaliseand provide high degrees of accuracy in extracting information from structured forms.
Coupled with Azure services this provides rapidly deployable, cost efficient andscalable solutions ready for production workloads.
The goal of this Playbook is to build a set of guidance, tools, examples and documentation that illustrate someknown techniques for information extraction, all of which have been applied in real customer solutions.
We hope that the Playbook can significantly reduce the overall development time by simplifying the decision makingprocess from defining the business problem to analysis and development.
The first focus of the Playbook is extraction of information from Forms.
The intended audience of this Playbook include:
This Playbook aims to provide step-by-step guidance for each phase of a typical Forms Extraction project alongside typical considerations, key outcomes and code accelerators per phase. To follow the guidance process see the Walkthrough or dip into the individual code accelerators
The best place to start if this is your first foray into this Playbook is with the Checklist, and then the Walkthrough to ensure that the most importantpoints are addressed in order to build a successful solution in this space.
We refer to the Supervised version of the Form Recognizer service when the argument Use Labels
set to True when training, and the Unsupervised version of Form Recognizer as when the argument Use Labels
is set to False.
We refer to a form issuer as being the unique source of a form, for example, the vendor of an invoice, or the bank of origin of an application form.
Stage | Scenario | Description |
---|---|---|
AutoLabelling and Prediction | AutoLabelling | Chains AutoLabelling, Training and Prediction on sample invoices |
Pre-Processing Remove Boxes | RemoveBoxes | Shows how to remove boxes that cause OCR errors and find the best image transformation |
Get Values in CheckBoxes | Detect and get CheckBox value | Detects and gets the value from CheckBoxes |
The following code accelerators serve as starting points to try approaches that are known to work for KnowledgeExtraction.Note - these accelerators need to be adapted to your data and tested and profiled, they are not production readyand need to be incorporated into your pipeline and profiled
The code accelerators included are available in Jupyter notebooks, APIs and python scripts that showcase some of the scenariosin this repository using diverse approaches.
Stage | Scenario | Description |
---|---|---|
Project preparation | Checklist | Steps to ensure success |
Project preparation | Decision Guidance | Core decision points |
Project preparation | Data Structure | Recommended training data structure |
Analysis | Understanding the data distribution | Illustrates a simple way to understand the distribution of vendor to invoice frequency |
Analysis | Understanding form variation | Illustrates how to analyse whether variation in a single form type exists |
Analysis | Form layout type labelling using clustering based on text features | Shows an approach which can be used to discover/label different layout types within a big dataset of forms images |
Analysis | Form layout clustering based on text and text layout features | Shows another approach which can be used to discover different layouts within a big dataset of images, taking words and positions of words on a page into account |
Analysis | Classifying forms | Illustrates how to use an attribute based search approach to classify forms for Form Recognizer model correlation |
Analysis | Routing forms | Demonstrates how to use OCR results to find which Form Recognizer model to send an unknown form to |
Pre-Processing | Image Channel Normalisation | Illustrates interactive normalisation, binarization and greyscale conversion |
Pre-Processing Remove Boxes | RemoveBoxes | Illustrates interactively how to remove boxes that cause OCR errors and find the best image transformation |
Pre-Processing | Conversion | Converting documents between various formats such as TIF to PDF, JPG to PDF etc |
Pre-Processing | Scan skewness | Illustrates testing and correcting skewness |
Pre-Processing | Projection | Illustrates how to identify document skew and location of text lines |
Pre-Processing | Detect and get CheckBox value | Illustrates how to detect and get a CheckBox value |
Pre-Processing | Optical Mark Recognition | Illustrates some techniques to determine if a checkbox exists and how to extract it |
Training | Dataset representativeness | Illustrates how test how to test the train and test datasets for representativeness |
Training | Named Entity Recognition | Illustrates how NER can be trained used to identify and extract entities on a form |
Training | Auto-labelling and training set optimisation | Illustrates how forms can be automatically labelled for the supervised version of Form Recognizer |
Training | Generating a taxonomy | Illustrates a simple approach to generating a taxonomy of known terms from the forms |
Extraction | Custom Corpus | Describes an approach to handling a custom corpus |
Extraction | Handwriting and common OCR Errors | Describes an approach how to deal with common errors |
Extraction | Predicting forms with Form Recognizer Supervised | Predicting forms with Forms Recognizer Supervised |
Extraction | Predicting forms with Form Recognizer Unsupervised | Predicting forms with Forms Recognizer Unsupervised |
Extraction | Using filter keys from a taxonomy | Illustrates how to filter the keys extracted from the unsupervised version of Form Recognizer using a taxonomy of known terms |
Extraction | Table Extraction | Illustrates extracting tables with Form Recognizer |
Evaluation | Scoring | Illustrates how to evaluate and score with Form Recognizer |
NEW (▀̿Ĺ̯▀̿ ̿)
Stage | Scenario | Description |
---|---|---|
Invoice Automation | PowerApps | Invoice Automation using the Power Platform |
The Pipelines section contains some example patterns and pipelines for Knowledge Extraction using Azure Services.
Scenario | Description |
---|---|
Azure Cognitive Search | Sample pipeline using Azure Cognitive Search |
Azure Kubernetes Service | Sample pipeline using Azure Kubernetes Service |
Azure Machine Learning | Sample pipeline using Azure Machine Learning |
Azure Logic Apps | Sample pipeline using Azure Logic Apps |
Azure (Durable) Functions | Sample pipeline using Azure (Durable) Functions |
For tips and best practices for managing Form Recognizer models via MLOps and deployment pipelines, view MLOps Tips and Tricks for Form Recognizer.
This section contains some documented common scenarios
Scenario | Description |
---|---|
CV or Resume Extraction | Sample extraction flow for a CV/Resume |
Email Extraction | Sample extraction from emails |
Geolocation Extraction | Sample extraction for Geolocation |
Prebuilt Receipt Model | Sample extraction for the prebuilt Receipt model |
Table extraction with Forms Recognizer | Sample extraction for Tables using Forms Recognizer |
Document Extraction detailed example using JFK Files | Sample extraction for Tables using Form Recognizer |
Dealing with multiple languages | Illustrates a few approaches with dealing with multiple languages |
Custom extraction from Japanese forms | Illustrates a an approach to custom extraction from Japanese forms |
Informative Image Selection using OCR with Form Recognizer Extraction | Illustrates an approach to selecting the most "informative" image from a group of similar images before extracting data with the Form Recognizer |
Read APIdetects text content in an image using our latest recognition models and converts the identified text into amachine-readable character stream. It's optimized for text-heavy images (such as documents that have been digitallyscanned) and for images with a lot of visual noise. It will determine which recognition model to use for each lineof text, supporting images with both printed and handwritten text. The Read API executes asynchronously becauselarger documents can take several minutes to return a result.
OCR API Computer Vision's optical character recognition (OCR)API is similar to the Read API, but it executes synchronously and is not optimized for large documents. It uses anearlier recognition model but works with more languages
Azure Cognitive Search is a fully managed search as a service to reducecomplexity and scale easily including:
Form Recognizer applies advancedmachine learning to accurately extract text, key/value pairs and tables from documents.
The Form Recognizer has two modesof operation:
The Custom Model requires the following for training:
Form Recognizer doesn't currently support these types of input data:
See more on training Form Recognizer here
The requirements for the prebuilt receipt model are slightly different.
Azure Machine Learning service is a cloudservice used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloudprovides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing NaturalLanguage systems at scale and for various AI model development related tasks like:
To successfully run these code accelerators, you will need an Azure subscriptionor can try Azure for free. There may be other Azure services or productsused in the code. Introduction and/or reference of those will be provided in the code themselves.
See CONTRIBUTING.md for contribution guidelines.
Please refer to the following fantastic references for additional material relevant to knowledge extraction:
F2E-Awesome 更新时间:2021-02-18 难度等级:☆ 为初级,☆☆ 为中级,☆☆☆ 为高级。 标签体系:开发工具、HMTL5、CSS、JS、主流框架、优化、Web服务器端、Serverless、源码学习、必学原理、前端类库、移动端、PWA、WebAssembly、小程序、Canvas、WebGL、SVG、Graphql、模块化编程、算法、排序、加密、数据结构、数据库、包管理、Pyt
Tandoor Recipes The recipe manager that allows you to manage your ever growing collection of digital recipes. Installation •Documentation •Demo •Discord server Your Feedback Share some information on
健康知识是基于医药吧网开放 API 的健康知识手机平台,该软件完全基于 Sencha Touch 开发。 运行效果: 类似软件 健康资讯:http://www.oschina.net/p/health-news 健康知识:http://www.oschina.net/p/health-knowledge 健康一问:http://www.oschina.net/p/health-ask-app
Response Extraction Rules Response extraction rules are used in various locations within Burp, to define the location within a response of a varying item that needs to be extracted. They are used to i
Sass Recipes parcel (recommended) gulp-sass: watch, compile, autoprefixer, sourcemaps node-sass TODO webpack css modules (with scss) styled-components prettier, lint-staged, husky
Computer Vision In recent years, we've see an extra-ordinary growth in Computer Vision, with applications in face recognition, image understanding, search, drones, mapping, semi-autonomous and autonom