OCR Process

The application leverages OCR (Optical Character Recognition) to extract text from PDFs and images.

Why OCR?

OCR is essential for converting scanned documents and images into machine-readable text, enabling further data processing and analysis. Many modern PDFs contain queriable text annotations, but there is no guarantee that the text is correct or complete. As such, OCR is used universally to provide a more reliable (albiet slower) method of extracting text from documents that works with both modern and legacy documents.

OCR Models

The app uses the open-source OCR library docTr, which provides a flexible framework for building OCR pipelines with publicly available, pre-trained models.

The docTr library provides a PyTorch-based interface for end-to-end OCR via a two-stage process of text detection and recognition, each with its own set of available models. The current implementation uses the db_resnet50 model for text detection and the master model for text recognition. (see perform_ocr in ocr)

Note

The current OCR models have not been fine-tuned for the specific use case of this app, and as such, they can struggle with certain edge cases such as super/subscripted and handwritten text. Substantial improvements to result accuracy could be made by fine-tuning the models on an annotated dataset of the specific use case.

OCR Model Call

The bulk of the OCR logic is handled under the hood by the docTr library, specifially in the ocr_predictor function call at line 82 of ocr. The only work on our end is to ensure the input is correctly formatted, the args are set properly, and the output is correctly parsed.

Input Formatting:

The input to the OCR model is a docTr-specific DocumentFile object, which is created from either a PDF or an image file.

Function Call:

The OCR model is called with the ocr_predictor function, which takes the formatted input and returns a dictionary containing the OCR results.

Consult the docTr documentation for more details on the available parameters and their effects on the OCR process (shouldn’t need to change the current values, but feel free to experiment).

Note

At the end of the model initialization (line 81 of ocr), the model is set to run on the currently configured PyTorch device (CPU or GPU).

Output Parsing:

The output of the OCR model is a dictionary containing the detected text, bounding boxes, and other metadata, but we are only interested in the text and bounding boxes (confidence score may be useful in the future).

The extract_words_from_page function in ocr is used to parse the output and extract the relevant information, returning a list of dictionaries containing the text and bounding box coordinates for each detected word. (line 158 of ocr_service for implementation)

Working with OCR Results

The OCR process generates several types of output that are stored in the application session and used for further processing:

Raw OCR Output Structure

The perform_ocr function returns a structured dictionary containing the following hierarchy:

{
    pages: {
        0: {
            blocks: {
                0: {
                    lines: {
                        0: {
                            words: {
                                0: {
                                    text: "extracted text",
                                    confidence: 0.95,
                                    geometry: {
                                        x_min: 100,
                                        y_min: 200,
                                        x_max: 150,
                                        y_max: 250
                                    },
                                },
                                ...
                            },
                            confidence: ...,
                            geometry: {...},
                        },
                        ...
                    },
                    confidence: ...,
                    geometry: {...},
                },
                ...
            },
            dimensions: {...},
            page_idx: ...,
        },
        ...
    }
}

This nested structure contains:

  • pages: Top-level dictionary keyed by page number (contains page-level data such as dimensions and index)

  • blocks: Contains detected text blocks (page segments which contain lines of text, analogous to paragraphs)

  • lines: Contains individual lines of text within each block

  • words: Contains individual words with their text, confidence scores, and bounding box coordinates

Text Extraction and Coordinate Processing

The raw OCR output undergoes several processing steps:

Word Extraction: The extract_words_from_page function processes each page to create a list of word dictionaries containing:

  • text: String containing the recognized text content

  • confidence: OCR confidence score (0.0 to 1.0)

  • geometry: Absolute coordinates of the word bounding box

  • page_num: Page number where the word was found

Coordinate Transformation: Since OCR operates directly on PDF files while the table detection works with image coordinates, the CoordinateTransformer in coord_normalization class handles conversion between:

  • PDF coordinate space (original document dimensions)

  • Image coordinate space (processed image dimensions)

This ensures accurate positioning of extracted text relative to the original document layout.

Table Detection and Reconstruction

After text extraction, the OCR pipeline identifies and reconstructs table structures. See Table Reconstruction for details on how tables are detected, reconstructed, and stored in the session.

Session Storage and Management

OCR results are stored in multiple session dictionaries for different use cases:

Core Results Storage:

  • session.ocr_results[file_path][page_num]: List of DataFrames for each detected table

  • session.words[file_path][page_num]: All extracted words with coordinate information

  • session.page_words[file_path][page_num]: Raw page-level OCR output

Visualization Data:

  • session.ocr_bboxes[file_path][page_num]: Table bounding boxes for overlay display

  • session.table_row_lines[file_path][page_num]: Row line coordinates for table visualization

  • session.table_column_lines[file_path][page_num]: Column line coordinates for table visualization

Note

While the tools for visualizing the OCR results are implemented, they are currently not exposed in the user interface.

Processed Output:

  • session.json_results_ocr[file_path]: Structured JSON format

  • session.csv_ocr[file_path]: CSV export of all extracted tables

  • session.processed_tables[file_path][page_num]: Fully processed table objects

Statistics and Monitoring:

  • session.ocr_stats[file_path]: Processing metrics including timing, table counts, and accuracy measures

Data Flow and Integration

The OCR results integrate with the broader application workflow:

Report Object Creation: OCR data populates ReportData objects containing:

  • Document metadata and page information

  • Complete word-level text extraction

  • Table objects with raw and processed data

  • Statistical information about the extraction process

Term Processing: Extracted text undergoes terminology correction using the TermProcessor to:

  • Fix common OCR errors using domain-specific dictionaries

  • Standardize technical terminology

  • Improve data quality for downstream analysis

Note

Term processing is currently disabled, but can be enabled if desired.

User Interface Integration

OCR results are presented to users through multiple interface components:

Processing Statistics: Processing metrics are displayed including:

  • Total processing time and pages processed

  • Average tables per page

  • Total tables detected in document

Progress Feedback: Real-time progress updates during OCR processing show:

  • Current processing stage (OCR analysis, table detection, data extraction)

  • Page-by-page progress indicators

  • Completion status for each processing step

Performance Optimization

The OCR implementation includes several performance optimizations:

Caching Strategy: Processed images and intermediate results are cached to:

  • Avoid redundant processing during user interactions

  • Speed up table editing and refinement operations

  • Reduce computational overhead for large documents

Hardware Acceleration: The system automatically detects and utilizes available hardware:

  • GPU acceleration via CUDA (Windows/Linux) or MPS (macOS)

  • Fallback to CPU processing when GPU is unavailable

  • Optimized model loading and memory management