Templates¶
Templates define patterns of rows and columns that identify tables of interest within extracted data. Once defined, they become part of a library that is used for processing all subsequent reports. Template matching is done automatically during the file import process and can be manually initiated in the Edit Data step or via the Predict feature.
Template Creation¶
Table templates define patterns of rows and columns that identify specific table types. To create a template in the Edit Data view:
Select one or more contiguous columns in the data table. These are the columns that will be mapped to the data model in a later step.
Include any empty/unimportant columns between the first and last columns of interest, since the template must match the exact layout of the columns.
Empty columns will be removed automatically, and unwated columns can be ignored later in the process.
Columns outside those selected will be ignored.
Select one or more contiguous rows in the data table that make uo complete heading of the desired data table.
The selection should start at the first row that contains heading data and continue down to include the last row before the table data starts.
Include any empty rows that may exist between the first and last heading rows, since the template must match the exact layout of the rows.
Rows above this will be ignored.
Select the table type from the dropdown menu.
Select the Create Template button.
This creates a new template for the selected table type with a unique numeric ID, and records the values for each selected column and row.
Note
The text in the selected template cells may be improper (e.g. spelling mistakes, improper splitting of spanning text, etc.) due to the import process - but this does not matter. In a later step, these cell values will be mapped to the correct standard model fields, and as long as the import process consistently outputs this improper text, it will be mapped correctly for subsequent reports.
It may be tempting to edit the data before creating the template. This will work, but the same edits will need to be made manually for the next report that contains the same errors, and this type of error is generally systematic/repeated for similar reports.
Warning
The term dictionary can potentially help to automatically clean up the text before the templating process to avoid these improper terms. Many of these errors invlove text that is split across several columns and rows however, so the term correction process could be more involved.
Warning
Templates, like all other configurations, are stored globally in the app and reused across all tenants. This may not be appropriate in the future for multiple customers, where one customer’s configuration might impact another’s.
Template Transformation¶
Template values are transformed before being stored, with unwanted characters removed/replaced and some data being generalized:
\n line breaks are replaced with spaces.
Standalone # characters are removed (this is common in some customer files)
html fragments are removed (e.g. <span> tags)
Leading/trailing whitespace is removed.
Blank cells and NaN values are set to None.
Numbers in parentheses (e.g. footnotes) are changed to numbers in square brackets, e.g. (-1) becomes [-1]. This prevents them from being interpreted as negative numbers if loaded into Excel.
Standalone numbers are changed to <num> so that a template can match any data that has the right pattern of words and numbers without needing a separate template for each unique set of numbers. For example, the DLComposition table contains columns for each pressure step, but the exact pressures can be different for each report.
Temperatures are identified and their numeric component is replaced with <temp> so that the template can match any test that includes a temperature in right location without needing to match the exact temperature value. For example, 175degF would be replaced with <temp>degF. Ideally, the temperature unit should also be replaced, but this is not currently done.
The transformed templates are stored in pvt_templates.csv and are loaded into the configuration cache as templates_df at app startup or when a configuration reload is requested.
When the templates are loaded to the the configuration cache, a dictionary is added to the configuration as templates.
This is keyed by the template name (<table_type>_<template_number>) and contains the template keys and values for the template.
Note
Template ‘keys’ are not dictionary keys, but rather a string representation of all values that make up a row of the template. They are used for quick pattern matching during the template matching process.
Keys are created for each row in the template by concatenating each of the column values, separated by a double underline ‘__’.
Values are an array of all the individual values that made up the key.
Template matching¶
The output of the file import process (either from Excel or PDF) is a table of data with generic column names (col1, col2, etc.). Within these tables, some rows contain the actual column names as they appear in the original document, some rows contain the contents of tables in the original report, and others contain cells contain individual values of interest.
Template matching is the process of determining if one or more template patterns appear in a table of data. This is done by attempting to match all column values of each template row to contiguous rows and columns in the data table.
Matching Process¶
A ‘key’ column is added to the target dataframe, where all row values are transformed and concatenated into a single string using the same logic that generated the template keys.
Templates are sorted by row length (number of terms in the template row) and number of rows (during configuration loading), so that the most complex templates are matched first.
The process iterates through the templates and data table rows, attempting to find the first key of the template in the key of each table row. If a match is found, it attempts to match subsequent template rows, if any.
If all of the template row matches are found in the keys of consecutive table rows - starting from the same table column - the template is considered matched.
The table row keys are updated to remove the portions that matched the template keys, and the process continues to attempt matching for other templates. In this way, multiple templates can be matched in a single original table, with the most unique templates taking precedence, and avoiding overlap between template keys.
The first and last row and column numbers are recorded for use in table refinement.
Warning
It is possible that the same template pattern could apply to more than one test type.
E.g. CVDComposition and DLComposition share some common columns. This is not currently handled, and the first matching template will set the table type.
Elsewhere in the app, there is logic (see get_previous_primary_table in fdReport) that finds the first main table type (e.g. cvdTest, dlTest, separatorTest) prior to the current table.
If a template is matched (e.g. DLComposition) that does not align with the previous primary table type (e.g. cvdTest), then that match could be rejected.
Test-Specific Template Processing¶
Additional processing is performed depending on the table type. This may include:
Truncating the table after two consecutive empty rows.
Truncating the table if the first column transitions from numeric to non-numeric values.
This is typically used for tables that contain step pressure.
Truncating the table based on row values matching text strings in the stop text collection in the configuration cache.
Removing empty rows.
Flattening the data from matrix-like to record format, where each row contains a single test result.
Fixing ambiguous column names.
Composition tables often have columns for the symbol and name of the components, but only a single column label.
‘symbol’ and ‘component’ labels are added to the column names to disambiguate them.
Complex custom processing as needed.
Table Refinement¶
When a template is matched, the data table is refined as follows:
Rows above the first matched template row are removed.
Columns before and after the matched template columns are removed.
Column values of the matched template rows are concatenated into single values, and promoted to be the table column names.
Test-specific processing is applied (if enabled), based on the identified table type. This typically involves truncation of the table based on specific rules and may include additional transformations to the data.
Empty rows are removed.
If multiple templates were matched to the original table:
Tables are truncated at the start of the next template if any overlap remains.
New report tables are created for each template match after the first.