|
Back >
User Profile
User Profile
Deep Learning Many large enterprises deal with unstructured data, such as PDF documents. These documents are often very complex and require extensive transformation create searchable data. In most cases, this requires copying relevant information from the original documents and converting it into an understandable format like JSON or CSV. This is a laborious and error-prone process that can take days or even weeks.
To make the process easier, a machine-learning system is used to recognize the pages containing the required data. For example, a machine learning model trained with a dataset from Statistics Canada would identify a subset of the document with a high concentration of tables. It would then use these key features to create a classification model. The pages identified in step one are then input to an algorithm called SLICE, which extracts all the information into a table.
The training data will consist of sets of PDFs paired with XML files that describe metadata about the PDFs. For example, the metadata of a typeset PDF will be defined in JATS, while the metadata of a PDF created by an author will be less detailed. Once this process is done, the PDFs will be analyzed for their content.
This automated process is capable of analyzing up to 70,000 PDF documents a year. It also reduces the amount of time needed to manually capture the required information. In addition, it reduces data redundancy.
|
|