Our method, CLIPDrawX synthesizes vector sketches conditioned on an input text prompt using simple primitive shapes like
circle, straight lines, and semi-circles, with the highlighted words used to create the cross-attention maps.

With the goal of understanding the visual concepts that CLIP associates with text prompts, we show that the latent space of CLIP can be visualized solely in terms of linear transformations on simple geometric primitives like circles and straight lines. Although existing approaches achieve this by sketch-synthesis-through-optimization, they do so on the space of Bézier curves, which exhibit a wastefully large set of structures that they can evolve into, as most of them are non-essential for generating meaningful sketches. We present CLIPDrawX, an algorithm that provides significantly better visualizations for CLIP text embeddings, using only simple primitive shapes like straight lines and circles. This constrains the set of possible outputs to linear transformations on these primitives, thereby exhibiting an inherently simpler mathematical form. The synthesis process of CLIPDrawX can be tracked end-to-end, with each visual concept being explained exclusively in terms of primitives.

CLIPDrawX comprises strategic canvas initialization, which utilizes diffusion-based cross-attention maps and a patch-wise
arrangement of primitives, along with a primitive-level dropout (PLD). The proposed model, coupled with the use of a pre-trained image
\( \mathcal{I} \) and text \( \mathcal{T} \) encoders from the CLIP model for similarity maximization, positions itself as an efficient and user-friendly tool in the
realm of AI-driven explainable sketch synthesis. The highlighted word is used to create the cross-attention maps.

The total loss, \( \mathcal{L}_\text{total} \) is the summation of two loss functions (semantic loss \( \mathcal{L}_\text{sem} \) and visual loss \( \mathcal{L}_\text{vis} \)), each weighted by their respective coefficients, \( \lambda_\text{sem} \) and \( \lambda_\text{vis} \). These two loss functions balance our sketch synthesis process: semantic loss aligns vector sketches with textual prompts, while visual loss maintains low-level spatial features and perceptual coherence. This combination effectively captures the intricate relationship between semantic fidelity and geometric accuracy.

The total loss, \( \mathcal{L}_\text{total} \) is the summation of two loss functions (semantic loss \( \mathcal{L}_\text{sem} \) and visual loss \( \mathcal{L}_\text{vis} \)), each weighted by their respective coefficients, \( \lambda_\text{sem} \) and \( \lambda_\text{vis} \). These two loss functions balance our sketch synthesis process: semantic loss aligns vector sketches with textual prompts, while visual loss maintains low-level spatial features and perceptual coherence. This combination effectively captures the intricate relationship between semantic fidelity and geometric accuracy.

$$\mathcal{L}_\text{total} = \lambda_\text{sem} \mathcal{L}_\text{sem} + \lambda_\text{vis} \mathcal{L}_\text{vis}$$

For more results, please visit gallery!

Our model skilfully captures the dynamics of shape or scene evolution, displaying varying levels of flexibility based on the degree of freedom, which is linked to the number of control points in a shape. In the

For more results, please visit Explainable Sketch Synthesis!

Placing different primitives at a single location is problematic, as it may result in clutter due to high point density, leading to uneven primitive distribution and messy sketches. To address this, our CLIPDrawX introduces primitives in fixed ranges or *patches*, representing all points within, rather than at precise attention map locations. It divides a \(224 \times 224\) canvas into patches of \(32 \times 32\).

Primitive-level dropout enhances the utilization of primitive shapes.
It facilitates the optimization process in a way that forces every primitive to independently encode some specific concept in the sketch. As a result, the number of noisy strokes in the sketch, that do not contribute towards depicting anything meaningful, is reduced.
Moreover, PLD expedites convergence by reducing the number of steps required for structure creation, thanks to its ability to promptly detect relevant strokes.

In traditional human sketching, artists often begin with a light outline or faint layout, serving as a foundation for the artwork. This initial phase sets the broader structure and composition. As the artwork advances, artists intensify strokes, especially focusing on crucial elements to make them prominent, ensuring each stroke adds value to the overall piece. Drawing a parallel to the digital realm, in CLIPDrawX, we have curated a similar methodology. Here, the initialization of primitive shapes starts with a low opacity value, denoted as \(\alpha\). This can be likened to the faint layout artists create. As the system begins its optimization process, based on relevance and significance, the opacity of certain primitives is incrementally increased. This mirrors the artist's method of iteratively intensifying strokes that are deemed crucial to the sketch's integrity.

Exploring the impact of varying primitive counts for each type within selected patches on sketch outcomes with diverse text prompts in CLIPDrawX.

- Few primitives yield abstract sketches (rows 1-2).
- Optimal detail emerges with 3-4 primitives (rows 3-4).
- Excessive counts (row 5) complicate optimization, hindering synthesis.

Our CLIPDrawX model is compared with three related methods, including \(\textbf{CLIPDraw}\), which synthesizes CLIP-guided text-to-sketches by optimizing Bezier curves; \(\textbf{BigGAN}\), utilizing a pre-trained BigGAN generator for image production, and \(\textbf{VectorFusion}\), employing a text-conditioned diffusion model for vector sketch creation. We maintain the original settings of all these models to ensure a fair comparison.
As demonstrated below, our CLIPDrawX produces sketches that are noticeably cleaner than those from CLIPDraw, likely due to our method's use of primitive-level dropout and the initialization of primitives at reduced opacity.
The images from BigGAN often lack proper details and appropriate semantics, and the vector sketches from VectorFusion tend to be overly abstract and detail-deficient.
In contrast, our CLIPDrawX consistently delivers clean sketches with accurate details and semantics.
The reduced noise in our sketches is attributed to the minimized control points, learnable opacity for primitives, and primitive-level dropout, thus reducing the need for manual intervention and parameter adjustments.

For more results, please visit SOTA Comparison!

We introduced the notion of explainable sketch synthesis through optimization via simple geometric primitives like straight lines, circles, and semicircles.
To this end, we presented CLIPDrawX, a model that synthesizes highly expressive sketches in an explainable manner via simple linear transformations (computed through optimization) on such primitives, leveraging novel techniques such as strategic sketch canvas initialization for synthesizing clean sketches, and the introduction of primitive-level dropout for producing sketches
with low noise, collectively enhance the model's efficiency and output quality.
The extensive experiments and ablation studies underscore the model's superiority over existing methods, showcasing its capability to produce sketches that are not only aesthetically appealing but also semantically rich and explainable.
Combining advanced optimization with intuitive design, CLIPDrawX stands out in AI-driven art creation, guiding future progress in Explainable AI and creative computing.

```
@misc{mathur2023clipdrawx,
title={CLIPDrawX: Primitive-based Explanations for Text Guided Sketch Synthesis},
author={Nityanand Mathur and Shyam Marjit and Abhra Chaudhuri and Anjan Dutta},
year={2023},
eprint={2312.02345},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```