When the word 'bank' is used to describe seashore landscape it will amount to negligence or show a lack of diction. But if the same word is misused by an AI art generator, an algorithm that is trained on millions of data sets, it is definitely a cause of concern. Large prompt-based systems such as Dalle E2 seem to be not yet ready to relate things the way humans do. In new research titled 'DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models' published by Bar-Ilan University and Allen Institute for Artificial Intelligence have concluded that Dalle E2 is not equipped enough to follow symbol-to-entity mapping in language.
These models are designed to rely on OpenAI's CLIP module – which cannot connect the objects to their respective properties accurately. The authors in the paper demonstrate how the model is poor at interpreting the meaning in the right context, sometimes resorting to re-using a particular symbol for different purposes. For 17 words it received and 216 images generated, Dall- E2 committed the homonym duplication for 80% of images. Stimuli control pairs were used to study if specific or elaborate descriptions are required to prevent image duplication and found that shared Dalle E2 prompts result in a shared property in 92.5% of cases and 6.6% in the case of a control prompt.
In the experiment, Dalle E generated images with different senses of nouns or with multiple senses at once or separating a word representing two different objects, clearly exposing the semantic leakage of properties between the entities. For example, for inputs like a seal, it generates an image with a living seal with a sealed cover. And for the prompt, 'a bat is flying over a baseball stadium', it generated an image with a wooden bat and a living bat flying across the baseball stadium. For a prompt like Goldfish, the output had images with gold and fish as two separate entities. The Dalle E art goes nuts with homonyms.
The researchers are of opinion that as the complexity of the model increases, the models get more vulnerable to semantic discrepancies and this tendency is found among all the CLIP-guided diffusion models. The downsized models such as DALL-E Mini commit fewer errors compared to their larger siblings. Stable Diffusion, a similar text-to-image generator, the paper mentions, also commit less of such errors because it doesn't follow the prompts. The researchers assume that Dalle E mini and Stable-diffusion can avoid errors only because they do not follow prompts and hence the output is only relatively better. Only when scale, model architecture, and concept leakage are studied in their relational terms, the model's efficiency can be estimated accurately.
Explaining the manifestation of the inductive bias as second-order effects, the paper mentions that Dalle E2 falls for duplication even when a hypernym or an implicit description of a modifier word is used. The results, and their implications, the researchers say, uncover the linguistic shortcomings of Dalle E2 making way for research not just in case of text to image generating models but also for text encoding, generative models, or both.
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.