From Pixels to Patterns: Learning the Visual Grammar of Document Layouts

From Pixels to Patterns: Learning the Visual Grammar of Document Layouts

Sanket Biswas will defended his PhD thesis on November 7, 2025.

What is the thesis about?

Understanding the visual and structural language of documents is central to Document AI. This thesis explores the hypothesis that layout acts as a latent language — a structured grammar that governs how information is arranged and interpreted in visually rich documents. Departing from traditional OCR-centric pipelines, we investigate layout-aware approaches across three interlinked axes: Interpretation, Representation, and Generation.

In the Interpretation axis, we introduce transformer-based segmentation frameworks, including SwinDocSegmenter and its semi-supervised extension SemiDocSeg, enabling precise instance-level parsing in both high-resource and low-resource settings.

For Representation, we develop self-supervised and graph-based models such as SelfDocSeg and Doc2GraphFormer, learning robust, task-agnostic embeddings that capture visual, spatial, and relational cues without reliance on annotated data.

In the Generation axis, we propose layout-conditioned generative frameworks — DocSynth, DocSynthv2, and SketchGPT — that model documents as sequences of layout primitives and enable controllable synthesis, sketch completion, and document design.

The collective contributions of this thesis establish a unified perspective of layout as both signal and structure, enabling end-to-end systems that not only read but reason and generate with layout awareness. We demonstrate the practical value of these contributions through deployments in real-world document intelligence systems and by proposing new benchmarks for multimodal document reasoning. This work opens new frontiers in treating layout not as noise to be removed, but as a language to be learned.

Keywords

Computer Vision, Pattern Recognition, Document AI, Document Understanding, Layout Understanding, Document Layout Analysis, Vision-Language Models, Instance Segmentation, Self-Supervised Learning, Semi-Supervised Learning, Graph Neural Networks, Document Generation, Layout as Language, Multimodal Reasoning, Structured Document Synthesis