# University of Chicago and Tel Aviv University researchers present “Text2Mesh”: a new framework for modifying both the color and geometry of 3D meshes based on a text target

In recent years, neuron-based generative models have been the center of attention for their exceptional ability to create aesthetically appealing graphical content that seems to come out of nowhere. Recent solutions of this type, like VQGAN and in general all derivations of Generative Adversarial Networks and their combination with other deep learning techniques, like OpenAI’s CLIP (a joint image-text integration model), have led to amazing results, using very complex and powerful tools. generative techniques. With the advent of NFTs and the application of transformer-based techniques to computer graphics in video games, the hype built in recent years for generative models may finally drive AI-generated art to respond to growing market demand for entertainment.

The main advantage of generative models, that is, their versatility in learning latent representations of given datasets, comes at the cost of higher complexity and lower success rate of learning experiences. Researchers from the University of Chicago and Tel Aviv University present the ‘Text2Mesh’ model. The Text2Mesh model tries to avoid the above problem by providing a non-generative method to modify the appearance of 3D meshes using a “*Neural style field*”(NSF), learned with neural techniques, which maps the vertices of an input mesh to the RGB color and local displacement along their normal direction, based on a text prompt that determines the style and appearance of the result. The model is fed by the joint CLIP text-image integration space and appropriate regularization to avoid degenerate solutions.

Previous work has explored the manipulation of *stylistic* features for images or meshes using CLIP, such as StyleCLIP, which uses a pre-trained StyleGAN to perform CLIP-guided image editing, VQGAN-CLIP which exploits the joint text-image integration space of CLIP to perform text-guided image generation or CLIPDraw, generating text-guided 2D vector graphics. Some approaches have instead tried to modify *stylistic* features on 3D meshes: for example, 3DStyleNet edits the content of the shape with partially conscious low-frequency deformation and synthesizes the colors on a texture map, guided by a target mesh, while ALIGNet deforms a model shape into one target shape. Unlike previous models, which rely heavily on the dataset used for training, Text2Mesh considers a wide range of styles, guided by a compact text specification.

Previous attempts to synthesize a style for an input 3D mesh have faced the problem of guessing a *texture setting* (imagine having a 2D world map and having to fit it onto a sphere). This approach overcomes this problem by generating the top-to-top fine-grained local texture and displacements, so there is no need to guess a point mapping from a 2D shape to a 3D surface.

Text2Mesh uses the “Neural Style Field” as “*neural a priori*“, based on its *inductive bias* (this is the tendency of a neural network to “assume” that each sample presented to it has characteristics common to the samples used for training) to distance the results from degenerate solutions present in the CLIP integration space (due to the many *false positives* in the association of images to text).

The vertices, which can be thought of as low dimensional (because they are represented by 3D vectors), are passed to a Multilayer Perceptron (MLP) to learn the *Neural style field* (which acts as a “style map” from vertex to one color and displacement along the normal direction). In cases where the mesh has sharp edges or very detailed 3D features, this leads to the appearance of *spectral* *bias*, it is the tendency of shallow networks to be unable to learn complex or high frequency functions. Text2Mesh overcomes this problem by using positional encoding based on *Fourier feature mappings*.

Text2Mesh also exhibits emerging properties in its ability to properly style parts of a mesh in accordance with their semantic role, as shown below: each part of the input mesh of a human body is properly styled and the resulting styles of different parts of the body are perfectly blended.

Given a 3D mesh, we can decouple *contents* (its global structure defining the overall shape surface and the topology) of *style* (determined by the color and texture of the object and its local fine-grained geometric details). In this parameter, Text2Mesh maps the *contents* from one mesh to one *style* which best corresponds to the descriptive text supplied as input with the 3D mesh.

A mesh represented as a list of vertex coordinates and an input text prompt, which will serve as a description of the required target features, is provided as the input to the model. The mesh will be fixed throughout the process and each text prompt of course requires a different training process.

First of all, a positional encoding is applied to each vertex of the mesh: for each point *p*, the positional coding is given by the following equation:

we can see that *Fourier feature mappings* are used to activate the neural network that learns the *Neural style field* to overcome spectral bias, in order to correctly synthesize high frequency details. In this equation, **B** is a random Gaussian matrix where each input is sampled at random from a zero-mean Gaussian distribution with a variance determined as a hyperparameter. We can see that by adjusting this variance, we are changing the input frequency range in the positional encoding, leading to higher frequency detail, like in the image below.

Typically, a standard deviation of 5.0 is used to sample the **B** matrix.

The resulting positional encoding is then transmitted to a multilayer Perceptron with an input layer of 256 dimensions, followed by 3 other hidden layers of 256 dimensions with ReLU activation. The neural network then splits into two layers of 256 dimensions with ReLU activations: one branch is used to estimate the vertex color while the other is used to estimate the vertex displacement. After a final linear layer, a tanh activation is applied to each branch, with both layers initialized to zero so that the original *contents* the mesh is not modified at initialization. The color output prediction, which is within the output range of the *tanh* activation function (-1, 1), is divided by 2 and added to [0.5, 0.5, 0.5] (gray color) so that each color is in the range (0.0, 1.0) and also to avoid unwanted solutions in the first iterations of training. The displacement, which is also within the output range of the *tanh* activation function, is instead multiplied by 0.1 so that it falls within the range (-0.1, 0.1), to avoid content-altering shifts. Two meshes are now obtained: the first, called the *stylized mesh*, is obtained by applying both color and vertex displacements to the original mesh, while the other, called the *displacement only* mesh, is obtained by applying only vertex displacements.

Now, the mesh is rendered at uniform intervals around a sphere, using a differentiable renderer, i.e. a renderer that allows the backpropagation of gradients, intended to be integrated into a neural network. For each render, the CLIP similarity to the target text prompt is calculated and the render that matches the best similarity is chosen as a *anchor view*.

Other 2D renderings, in a number determined in hyperparameter (5 during the experiments) are now carried out around the *anchor view*, at angles sampled according to a Gaussian distribution centered on the anchor and of variance / 4. This process is performed for both the *stylized mesh* and the *displacement mesh only*.

Next, two image augmentations are taken from a set of possible augmentation parameters: *global* one (which applies a random perspective transformation) and one *local* one (which crops the render to 10% of its original size, then applies a random perspective transformation). the *global* the increase is applied to the renderings of the *stylized mesh*, while the *local* the increase is applied to both the renderings of the *stylized* mesh and the *displacement only* mesh.

The augmented renderings are then integrated into the CLIP space and the integrations are averaged over all the views, according to the following equations:

Where the *I fill* and *I displ* the terms are the original 2D renderings and the embeddings are averaged over all views rendered once, with the overall increase, for the *stylized* seen and twice, with the *local* increase, for the *stylized* view and the *displacement only* see. The resulting embeddings belong to a space of 512 dimensions. Then the *target text* *t* is also integrated into the CLIP space as follows:

Finally, the *semantic loss value* is calculated according to the following equation, simply by adding the *cosine-similarity* averaged embeddings of the augmented 2D renderings in the CLIP space with the CLIP embedding of the *target text*:

or:

The above is repeated with new increases sampled for a number of times, determined as a hyperparameter, for each update of the *Neural style field*.

The experiments were performed using a wide variety of input source meshes (taken from various sources including COSEG, Thingi10K, Shapenet, Turbo Squid, and ModelNet), also including low quality meshes. The training took less than 25 minutes on a single Nvidia GeForce RTX2080Ti GPU, using the Adam optimizer with decay learning. Various ablation studies, carried out by deactivating parts of the model, show that this combination of treatment steps and this choice of hyperparameters leads to optimal performance.

This new technique shows very interesting results and can be considered as a major improvement over the specificity of previous applications of generative models to graphic and artistic tasks. Considering the great complexity of this task and the relatively low training time and computing resources required, this technology can open up many new possibilities, especially in the field of video games and entertainment, given the simultaneous interaction between *contents* and *styling* meshes and their underlying semantics. It could even help fuel the creation of digital artistic content, with interesting implications in emerging NFT markets.

Article: https://arxiv.org/pdf/2112.03221.pdf

GitHub: https://github.com/threedle/text2mesh

Project page: https://threedle.github.io/text2mesh/