OCR Process =========== The application leverages OCR (Optical Character Recognition) to extract text from PDFs and images. Why OCR? -------- OCR is essential for converting scanned documents and images into machine-readable text, enabling further data processing and analysis. Many modern PDFs contain queriable text annotations, but there is no guarantee that the text is correct or complete. As such, OCR is used universally to provide a more reliable (albiet slower) method of extracting text from documents that works with both modern and legacy documents. OCR Models ---------------- The app uses the open-source OCR library `docTr `_, which provides a flexible framework for building OCR pipelines with publicly available, pre-trained models. The docTr library provides a PyTorch-based interface for end-to-end OCR via a two-stage process of text detection and recognition, each with its own set of available models. The current implementation uses the `db_resnet50` model for text detection and the `master` model for text recognition. (see ``perform_ocr`` in :ref:`ocr`) .. note:: The current OCR models have not been fine-tuned for the specific use case of this app, and as such, they can struggle with certain edge cases such as super/subscripted and handwritten text. Substantial improvements to result accuracy could be made by fine-tuning the models on an annotated dataset of the specific use case. OCR Model Call -------------- The bulk of the OCR logic is handled under the hood by the `docTr` library, specifially in the ``ocr_predictor`` function call at line 82 of :ref:`ocr`. The only work on our end is to ensure the input is correctly formatted, the args are set properly, and the output is correctly parsed. **Input Formatting**: The input to the OCR model is a docTr-specific ``DocumentFile`` object, which is created from either a PDF or an image file. **Function Call**: The OCR model is called with the `ocr_predictor` function, which takes the formatted input and returns a dictionary containing the OCR results. Consult the `docTr` `documentation `_ for more details on the available parameters and their effects on the OCR process (shouldn't need to change the current values, but feel free to experiment). .. note:: At the end of the model initialization (line 81 of :ref:`ocr`), the model is set to run on the currently configured PyTorch device (CPU or GPU). **Output Parsing**: The output of the OCR model is a dictionary containing the detected text, bounding boxes, and other metadata, but we are only interested in the text and bounding boxes (confidence score may be useful in the future). The ``extract_words_from_page`` function in :ref:`ocr` is used to parse the output and extract the relevant information, returning a list of dictionaries containing the text and bounding box coordinates for each detected word. (line 158 of :ref:`ocr_service` for implementation) Working with OCR Results ------------------------- The OCR process generates several types of output that are stored in the application session and used for further processing: Raw OCR Output Structure ~~~~~~~~~~~~~~~~~~~~~~~~~ The ``perform_ocr`` function returns a structured dictionary containing the following hierarchy: .. code-block:: text { pages: { 0: { blocks: { 0: { lines: { 0: { words: { 0: { text: "extracted text", confidence: 0.95, geometry: { x_min: 100, y_min: 200, x_max: 150, y_max: 250 }, }, ... }, confidence: ..., geometry: {...}, }, ... }, confidence: ..., geometry: {...}, }, ... }, dimensions: {...}, page_idx: ..., }, ... } } This nested structure contains: - **pages**: Top-level dictionary keyed by page number (contains page-level data such as dimensions and index) - **blocks**: Contains detected text blocks (page segments which contain lines of text, analogous to paragraphs) - **lines**: Contains individual lines of text within each block - **words**: Contains individual words with their text, confidence scores, and bounding box coordinates Text Extraction and Coordinate Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The raw OCR output undergoes several processing steps: **Word Extraction**: The ``extract_words_from_page`` function processes each page to create a list of word dictionaries containing: - ``text``: String containing the recognized text content - ``confidence``: OCR confidence score (0.0 to 1.0) - ``geometry``: Absolute coordinates of the word bounding box - ``page_num``: Page number where the word was found **Coordinate Transformation**: Since OCR operates directly on PDF files while the table detection works with image coordinates, the ``CoordinateTransformer`` in :ref:`coord_normalization` class handles conversion between: - PDF coordinate space (original document dimensions) - Image coordinate space (processed image dimensions) This ensures accurate positioning of extracted text relative to the original document layout. Table Detection and Reconstruction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After text extraction, the OCR pipeline identifies and reconstructs table structures. See :doc:`table_reconstruction` for details on how tables are detected, reconstructed, and stored in the session. Session Storage and Management ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ OCR results are stored in multiple session dictionaries for different use cases: **Core Results Storage**: - ``session.ocr_results[file_path][page_num]``: List of DataFrames for each detected table - ``session.words[file_path][page_num]``: All extracted words with coordinate information - ``session.page_words[file_path][page_num]``: Raw page-level OCR output **Visualization Data**: - ``session.ocr_bboxes[file_path][page_num]``: Table bounding boxes for overlay display - ``session.table_row_lines[file_path][page_num]``: Row line coordinates for table visualization - ``session.table_column_lines[file_path][page_num]``: Column line coordinates for table visualization .. note:: While the tools for visualizing the OCR results are implemented, they are currently not exposed in the user interface. **Processed Output**: - ``session.json_results_ocr[file_path]``: Structured JSON format - ``session.csv_ocr[file_path]``: CSV export of all extracted tables - ``session.processed_tables[file_path][page_num]``: Fully processed table objects **Statistics and Monitoring**: - ``session.ocr_stats[file_path]``: Processing metrics including timing, table counts, and accuracy measures Data Flow and Integration ~~~~~~~~~~~~~~~~~~~~~~~~~ The OCR results integrate with the broader application workflow: **Report Object Creation**: OCR data populates ``ReportData`` objects containing: - Document metadata and page information - Complete word-level text extraction - Table objects with raw and processed data - Statistical information about the extraction process **Term Processing**: Extracted text undergoes terminology correction using the ``TermProcessor`` to: - Fix common OCR errors using domain-specific dictionaries - Standardize technical terminology - Improve data quality for downstream analysis .. note:: Term processing is currently disabled, but can be enabled if desired. User Interface Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~ OCR results are presented to users through multiple interface components: **Processing Statistics**: Processing metrics are displayed including: - Total processing time and pages processed - Average tables per page - Total tables detected in document **Progress Feedback**: Real-time progress updates during OCR processing show: - Current processing stage (OCR analysis, table detection, data extraction) - Page-by-page progress indicators - Completion status for each processing step Performance Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~ The OCR implementation includes several performance optimizations: **Caching Strategy**: Processed images and intermediate results are cached to: - Avoid redundant processing during user interactions - Speed up table editing and refinement operations - Reduce computational overhead for large documents **Hardware Acceleration**: The system automatically detects and utilizes available hardware: - GPU acceleration via CUDA (Windows/Linux) or MPS (macOS) - Fallback to CPU processing when GPU is unavailable - Optimized model loading and memory management