Mapping Process¶
This section describes the mapping capabilities of the Fluidsdata Digitization and OCR application. It covers how to map digitized data to structured formats, including database schemas and data models.
UI Overview¶
After the user selects a specific report page, the mapping process begins. This includes several UI steps:
Step |
Description |
|---|---|
Edit Data |
Data tables are identified in the raw data and shaped to support the subsequent mapping processes, including the use of templates. |
Map Columns |
Table columns and units of measure (referred to as UOM) are mapped to the standard data model, either automatically or through user input. |
Map Headers |
Header data is extracted from the report data, either automatically or through user input. Units of measure are mapped to the standard data model. |
Review Data |
Final table and header data is presented to the user in normalized (values corrected to standard datatypes, component names, and enumeration values) and original formats. Table data can then be saved. |
Data Transformation Flow¶
Report data progresses through multiple stages before the final output files are created. Each of these stages are saved as part of the report object model (fdReport).
Data Set |
Description |
Where Displayed |
|---|---|---|
|
Dataframe containing the original table data extracted from the report. |
not displayed |
|
Dataframe containing the table data that has been curated prior to mapping and updated by the table matching process and data editing UI. |
Edit Data, Map Columns |
|
Array of objects containing mapping data for each table column (original value, predicted value, mapped value, etc.). |
Map Columns |
|
Dataframe containing mapping data for each header. |
Review |
|
Dataframe containing the table data with columns updated to standard data model with added UOMs. |
Review |
|
Dataframe containing the header data with columns updated to standard data model with added UOMs. |
Review |
|
Dataframe containing the table data with values normalized to correct/standardized datatypes and enumerations. |
Review |
|
Dataframe containing the header data with values normalized to correct/standardized datatypes and enumerations. |
Review |
|
The output file for the table, with header and table data combined into one CSV file. |
Saved Data, Test Data |
|
All output files for the same test type concatenated into one CSV file. Multiple files for the same report, sample ID, and test type are merged into the same rows where appropriate (e.g. data for the same sample can be spread across multiple tables). |
Test Data |
Data Editing¶
Raw data is extracted during import of PDF and Excel report documents (other formats in the future) and stored as an unstructured table. The data editing process shapes the data into a structured table that can be mapped to the standard data model. This includes:
Identifying the table type (e.g. which PVT test table it represents)
Identifying the rows and columns that make up a table to be mapped to the standard data model, for example identifying the specific rows and columns of the CCE (Constant Composition Expansion) Test Step table.
Identifying which row(s) contain column heading information for the refined table and setting them as column names in the structured table.
Editing table data to correct, add, or delete data that was incorrectly extracted or missing from the report.
A table object is created for each page in the report that contains extracted data. The OCR process for importing PDF files has the option to create multiple tables per page if it recognizes more than one table pattern.
The extracted data for each table is stored as a DataFrame in table.table_data_raw, where table is the Table object (fdReport).
This is the starting point for the data editing process and save for one exception, is never changed after the initial extraction.
This allows the user to always return to the original data if needed.
Re-Extracting Tables¶
Note
All changes described below are temporary only while the current table is being worked on. If the table is saved at the end of the mapping process, the changes are saved for later sessions. Any unsaved changes are lost when the session ends.
Tables can be identified incorrectly during the OCR/table reconstruction process. For example, the number of columns may be incorrect (e.g. thrown off by header/footer data being interpreted as rows) so the data is not aligned correctly in the resulting table. While there are several behind the scenes processes to minimize these issues (Table Reconstruction), problem tables can still slip through.
In this case, the user is given the ability to re-extract the table data with more fine-tuned control.
Note
When re-extracting a table the original text data is not changed, only the table structure is redefined. This saves on processing time as OCR is not re-run, but it should be noted that errors in the text itself will not be corrected with this method, only the table layout/structure (manual data editing is still required for correcting text errors).
Process:
Select the button Manually select table on image.
A red bounding box will appear on the image, indicating the area to be re-extracted (area inside the box will be considered as a table, though there are no restrictions on the size/location of the box).
Drag and resize the box to enclose the desired content.
Select the Process selected area as table button to view the re-extracted table data underneath the image.
If more refinement is needed, the user can specify the number of columns desired in the table, as well as increment the row count (up or down) to its desired value. The reconstruction algorithm will automatically adjust text positions to fit the specified number of columns and/or rows.
After supplying row or column counts, re-run the extraction using the same button as before.
When satisfied with the new table structure, select the Replace Table button. This will replace the previous table data with the newly extracted table data. This updates the
table_data_rawandtable_data_editeddataframes in the Table object. (As mentioned above, these changes will not persist between sessions unless the table is saved at the end of the mapping process.)
If more than one table is present on the page, overwriting with the re-extracted table will only affect the currently selected table. If the user wishes to extract multiple tables from one image, they must first make copies of the original table object, then re-extract and replace each copy individually.
Warning
This function allows the user to select multiple bounding boxes on the image, but only the first can be used to update the current table. Potential future improvements could allow the user to identify multiple tables from one page and update all existing tables in one go. Another option to avoid confusion would be to edit the manual selection streamlit plugin to only allow one selection at a time.
Applying Templates¶
Templates are the primary way that table types are identified, and data is shaped. It is the only process that happens automatically during the mapping process.
Note
The following sections describe ways the user can modify the table data before or after templates are applied. User edits affect only the current table, and are not automatically applied to other tables of the same type. If templates are created against edited tables, they will typically not work with other tables and reports unless they already match the edited format.
Editing Table Data¶
Any data in the table can be edited by the user to correct, add, or remove data prior to the mapping step.
The user selects the Edit Data button to open the data editing interface where they can make and save changes (updating table_data_edited).
Note
Typically, only table contents (e.g. text results) should be edited here. Re-extraction should be used to change the table structure.
Reset Table¶
The user can select the Reset Table button if they want to revert to the original, unstructured table data.
This copies table_data_raw into table_data_edited and can be done irregardless of whether or not the table has been previously saved.
Multiple Tables¶
There can be multiple table objects associated with a single page. These can be identified automatically in the OCR process (if the table identification toggle is enabled), by matching more than one template to the data, or by user actions described below. If more than one table is defined for a page, table selection buttons will appear at the top of the page with the currently selected table highlighted in bold.
Add Table¶
The user can add a new table of a specified type with a specified number of rows. This will be an empty table with all possible fields for the selected table type. The user can enter or paste data into the appropriate columns.
Copy Table¶
Sometimes there are multiple sets of desired data on a page, but they cannot be separated easily. For example, there can be two tables of data that share a common header (e.g. composition and compositionProperties). In such cases, the user can make a copy of the current table and edit both versions independently.
Note
The app tracks the origins of the copied table, and allows the user to delete the copy without affecting the original table.
Split Table¶
Similarly, the user may want to split a table in two in order to facilitate mapping. They select the row to split at, and the table is divided into two at that point,
including both table_data_raw and table_data_edited dataframes. The app tracks the origins of the split table and lets the user unsplit if needed,
joining the two tables back together. A split table can be further split if needed, however, the split can only be undone in reverse order.
Note
This is the only editing function that changes table_data_raw. Resetting a split table only resets the split portion of the table.
Note
Template matching can find multiple candidate tables in a page and split them automatically, so manually splitting is not likely to be needed.
Row Functions¶
Rows can be added, deleted, duplicated, edited, and moved vertically within the table. Since template matching automatically deletes unnecessary rows, these functions are typically only required if the import process is not accurate or if the report format necessitates copying and editing a table.
Column Functions¶
Columns can be added, deleted, or duplicated. They can also be split (in case a single cell contains multiple values) or merged (in case a single column is incorrectly imported as two). Again, these functions are not typically required.