Metadata only
Date
2024-02-05Type
- Conference Paper
ETH Bibliography
yes
Altmetrics
Abstract
Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end -to-end trainable. In this work, we introduce the Document Structure Generator (DSG), a novel system for document parsing that is fully end -to -end trainable. DSG combines a deep neural network for parsing (i) entities in documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that capture the sequence and nested structure between entities. Unlike existing systems that rely on heuristics, our DSG is trained end -to -end, making it effective and flexible for real -world applications. We further contribute a new, large-scale dataset called E-Periodica comprising real -world magazines with complex document structures for evaluation. Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance. To the best of our knowledge, our DSG system is the first end -to-end trainable system for hierarchical document parsing. Show more
Publication status
publishedExternal links
Book title
2023 IEEE International Conference on Data Mining (ICDM)Pages / Article No.
Publisher
IEEEEvent
Subject
Information Extraction; Parsing; Data Mining; Document AnalysisOrganisational unit
02150 - Dep. Informatik / Dep. of Computer Science
Funding
200021_156011 - Hierarchical carbon-fiber composites with tailored interphase obtained via electrophoretic deposition of magnetized and funtionalized carbon nanotubes (SNF)
184628 - EASEML: Toward a More Accessible and Usable Machine Learning Platform for Non-expert Users (SNF)
197485 - Governance and legal framework for managing artificial intelligence (AI) (SNF)
187132 - Machine‐based Scoring of a Neuropsychological Test: The Rey‐Osterrieth Complex Figure (SNF)
957407 - Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning (EC)
Notes
Conference lecture held on December 4, 2023.More
Show all metadata
ETH Bibliography
yes
Altmetrics