Mapping Process

This section describes the mapping capabilities of the Fluidsdata Digitization and OCR application. It covers how to map digitized data to structured formats, including database schemas and data models.

UI Overview

After the user selects a specific report page, the mapping process begins. This includes several UI steps:

Mapping Steps

Step

Description

Edit Data

Data tables are identified in the raw data and shaped to support the subsequent mapping processes, including the use of templates.

Map Columns

Table columns and units of measure (referred to as UOM) are mapped to the standard data model, either automatically or through user input.

Map Headers

Header data is extracted from the report data, either automatically or through user input. Units of measure are mapped to the standard data model.

Review Data

Final table and header data is presented to the user in normalized (values corrected to standard datatypes, component names, and enumeration values) and original formats. Table data can then be saved.

Data Transformation Flow

Report data progresses through multiple stages before the final output files are created. Each of these stages are saved as part of the report object model (fdReport).

Data Progression Flow
Report Object Data Sets

Data Set

Description

Where Displayed

table_data_raw

Dataframe containing the original table data extracted from the report.

not displayed

table_data_edited

Dataframe containing the table data that has been curated prior to mapping and updated by the table matching process and data editing UI.

Edit Data, Map Columns

table_column_mappings

Array of objects containing mapping data for each table column (original value, predicted value, mapped value, etc.).

Map Columns

table_header_mapping

Dataframe containing mapping data for each header.

Review

table_data_mapped

Dataframe containing the table data with columns updated to standard data model with added UOMs.

Review

header_data_mapped

Dataframe containing the header data with columns updated to standard data model with added UOMs.

Review

table_data_normalized

Dataframe containing the table data with values normalized to correct/standardized datatypes and enumerations.

Review

header_data_normalized

Dataframe containing the header data with values normalized to correct/standardized datatypes and enumerations.

Review

<report>_<page>_<table>_<table_type>.csv

The output file for the table, with header and table data combined into one CSV file.

Saved Data, Test Data

Merged_<table_type>.csv

All output files for the same test type concatenated into one CSV file. Multiple files for the same report, sample ID, and test type are merged into the same rows where appropriate (e.g. data for the same sample can be spread across multiple tables).

Test Data

Data Editing

Raw data is extracted during import of PDF and Excel report documents (other formats in the future) and stored as an unstructured table. The data editing process shapes the data into a structured table that can be mapped to the standard data model. This includes:

  • Identifying the table type (e.g. which PVT test table it represents)

  • Identifying the rows and columns that make up a table to be mapped to the standard data model, for example identifying the specific rows and columns of the CCE (Constant Composition Expansion) Test Step table.

  • Identifying which row(s) contain column heading information for the refined table and setting them as column names in the structured table.

  • Editing table data to correct, add, or delete data that was incorrectly extracted or missing from the report.

A table object is created for each page in the report that contains extracted data. The OCR process for importing PDF files has the option to create multiple tables per page if it recognizes more than one table pattern.

The extracted data for each table is stored as a DataFrame in table.table_data_raw, where table is the Table object (fdReport). This is the starting point for the data editing process and save for one exception, is never changed after the initial extraction. This allows the user to always return to the original data if needed.

Re-Extracting Tables

Note

All changes described below are temporary only while the current table is being worked on. If the table is saved at the end of the mapping process, the changes are saved for later sessions. Any unsaved changes are lost when the session ends.

Tables can be identified incorrectly during the OCR/table reconstruction process. For example, the number of columns may be incorrect (e.g. thrown off by header/footer data being interpreted as rows) so the data is not aligned correctly in the resulting table. While there are several behind the scenes processes to minimize these issues (Table Reconstruction), problem tables can still slip through.

In this case, the user is given the ability to re-extract the table data with more fine-tuned control.

Note

When re-extracting a table the original text data is not changed, only the table structure is redefined. This saves on processing time as OCR is not re-run, but it should be noted that errors in the text itself will not be corrected with this method, only the table layout/structure (manual data editing is still required for correcting text errors).

Process:

  • Select the button Manually select table on image.

  • A red bounding box will appear on the image, indicating the area to be re-extracted (area inside the box will be considered as a table, though there are no restrictions on the size/location of the box).

  • Drag and resize the box to enclose the desired content.

  • Select the Process selected area as table button to view the re-extracted table data underneath the image.

  • If more refinement is needed, the user can specify the number of columns desired in the table, as well as increment the row count (up or down) to its desired value. The reconstruction algorithm will automatically adjust text positions to fit the specified number of columns and/or rows.

  • After supplying row or column counts, re-run the extraction using the same button as before.

  • When satisfied with the new table structure, select the Replace Table button. This will replace the previous table data with the newly extracted table data. This updates the table_data_raw and table_data_edited dataframes in the Table object. (As mentioned above, these changes will not persist between sessions unless the table is saved at the end of the mapping process.)

If more than one table is present on the page, overwriting with the re-extracted table will only affect the currently selected table. If the user wishes to extract multiple tables from one image, they must first make copies of the original table object, then re-extract and replace each copy individually.

Warning

This function allows the user to select multiple bounding boxes on the image, but only the first can be used to update the current table. Potential future improvements could allow the user to identify multiple tables from one page and update all existing tables in one go. Another option to avoid confusion would be to edit the manual selection streamlit plugin to only allow one selection at a time.

Applying Templates

Templates are the primary way that table types are identified, and data is shaped. It is the only process that happens automatically during the mapping process.

Note

The following sections describe ways the user can modify the table data before or after templates are applied. User edits affect only the current table, and are not automatically applied to other tables of the same type. If templates are created against edited tables, they will typically not work with other tables and reports unless they already match the edited format.

Editing Table Data

Any data in the table can be edited by the user to correct, add, or remove data prior to the mapping step.

The user selects the Edit Data button to open the data editing interface where they can make and save changes (updating table_data_edited).

Note

Typically, only table contents (e.g. text results) should be edited here. Re-extraction should be used to change the table structure.

Reset Table

The user can select the Reset Table button if they want to revert to the original, unstructured table data. This copies table_data_raw into table_data_edited and can be done irregardless of whether or not the table has been previously saved.

Multiple Tables

There can be multiple table objects associated with a single page. These can be identified automatically in the OCR process (if the table identification toggle is enabled), by matching more than one template to the data, or by user actions described below. If more than one table is defined for a page, table selection buttons will appear at the top of the page with the currently selected table highlighted in bold.

Multiple Tables Selection

Add Table

The user can add a new table of a specified type with a specified number of rows. This will be an empty table with all possible fields for the selected table type. The user can enter or paste data into the appropriate columns.

Copy Table

Sometimes there are multiple sets of desired data on a page, but they cannot be separated easily. For example, there can be two tables of data that share a common header (e.g. composition and compositionProperties). In such cases, the user can make a copy of the current table and edit both versions independently.

Note

The app tracks the origins of the copied table, and allows the user to delete the copy without affecting the original table.

Split Table

Similarly, the user may want to split a table in two in order to facilitate mapping. They select the row to split at, and the table is divided into two at that point, including both table_data_raw and table_data_edited dataframes. The app tracks the origins of the split table and lets the user unsplit if needed, joining the two tables back together. A split table can be further split if needed, however, the split can only be undone in reverse order.

Note

This is the only editing function that changes table_data_raw. Resetting a split table only resets the split portion of the table.

Note

Template matching can find multiple candidate tables in a page and split them automatically, so manually splitting is not likely to be needed.

Row Functions

Rows can be added, deleted, duplicated, edited, and moved vertically within the table. Since template matching automatically deletes unnecessary rows, these functions are typically only required if the import process is not accurate or if the report format necessitates copying and editing a table.

Column Functions

Columns can be added, deleted, or duplicated. They can also be split (in case a single cell contains multiple values) or merged (in case a single column is incorrectly imported as two). Again, these functions are not typically required.