LayoutTransformer: Layout Generation and Completion with Self-attention

LayoutTransformer: Layout Generation and Completion with Self-attention (Paper reading)

Kamal Gupta, University of Maryland, US, Cited:41, Code, Paper

1 Introduction

We address the problem of scene layout generation in various domains such as images, mobile applications, documents, and 3D objects. Most complex scenes, whether natural or artificially designed, can be represented by meaningful arrangements of simply composed graph primitives. Generating new layouts or extending existing ones requires an understanding of the relationships between these primitives. To this end, we propose LayoutTransformer, a novel framework that leverages self-attention to learn contextual relations between layout elements and generate new layouts in a given domain. Our framework can generate new layouts from an empty set or an initial set of seed primitives, and can be easily extended to support an arbitrary number of primitives in each layout. Furthermore, our analysis shows that the model is able to automatically capture the semantic properties of primitives. We propose a simple improvement in the representation of layout primitives as well as in the training method to be used in very diverse data domains such as object bounding boxes in natural images (COCO bounding boxes), documents (PubLayNet), mobile applications (RICO data set) and 3D shapes (PartNet)) exhibit competitive performance.

2. Holistic thinking

Similar to NLP, it just turns the text into a layout, and predicts the next word (layout) based on the previous word (layout).

3. Method

In this section, we introduce our attention network on the problem of layout generation. We first discuss representations for primitive layouts in different domains. Next, we discuss the LayoutTransformer framework and show how Transformers can be used to model the probability distribution of layouts. Masked Multi-headed Self-Attention enables us to learn non-local semantic relations among layout primitives, and also provides us with the flexibility to handle variable-length layouts.

Given a dataset of layouts, a layout instance can be defined as a graph G of n nodes, where each node i ∈ {1, . . . , n} is a graph primitive. We assume that the graph is fully connected and let the attention network learn the relationship between nodes. Nodes can be associated with structural or semantic information. For each node, we project the information associated with it into a d-dimensional space denoted by the feature vector si. Note that the information itself can be discrete (e.g. part of a category), continuous (e.g. color) or multidimensional vector (e.g. part of a signed distance function) on some manifold. Specifically, in our ShapeNet experiments, we use a multi-layer perceptron (MLP) to project partial embeddings into a d-dimensional space, while in our 2D layout experiments we use a learned d-dimensional category embedding, comparable to for projecting one-hot encoded class vectors into the latent space using an MLP with zero bias.

Representing geometry using discrete variables: We apply 8-bit uniform quantization to each geometry field and model it using a categorical distribution. Discretization of continuous signals is an approach previously employed in image generation, however, to the best of our knowledge, has not been explored in layout modeling tasks. We observe that even though discretizing coordinates introduces approximation errors, it enables us to represent arbitrary distributions, which is especially important for layouts with strong symmetries such as documents and application wireframes. We independently project the geometric fields of each primitive into the same d-dimensional space such that the i-th primitive in R 2 R^2R2 middle display is (si , xi , yi , hi , wi s_i, x_i, y_i, h_i, w_isi,xi,yi,hi,wi). We concatenate all elements to a flat sequence of its arguments. We also append the embedding with two additional parameters s⟨bos⟩ and s⟨eos⟩, which denote the beginning and end of the sequence. Now, the layout can be represented by a sequence of 5n + 2 latent vectors.

insert image description here
LayoutTransformer takes a layout element as input and predicts the next layout element as output. During training, we use teacher forcing, i.e., use the ground truth layout markers as input to the multi-head decoder block. The first layer of this block is a masked self-attention layer that enables the model to only see the previous element in order to predict the current element. We add a special 〈bos〉 tag at the beginning of each layout and an 〈eos〉 tag at the end for padding.

Given an initial K visible primitives (K can be 0 when generated from scratch), our attention-based model takes as input a random permutation π = (π1, . . . , πK) of visible nodes, Therefore, a sequence consisting of d-dimensional vectors (θ1, . . . , θ5K) is obtained. We found this to be an important step because by decomposing primitive representations into geometric and structural fields, our attention module can explicitly assign weights to each coordinate dimension.

insert image description here

4. Experiment

insert image description here

Guess you like

Origin blog.csdn.net/qq_43800752/article/details/131131073