DALL-E
Original author(s) | OpenAI |
---|---|
Initial release | 5 January 2021 |
Type | Transformer language model |
Website | www |
Part of a series on |
Artificial intelligence |
---|
DALL-E (stylized DALL·E) is an artificial intelligence program that creates images from textual descriptions.
It uses a 12-billion parameter[1] version of the GPT-3 Transformer model to interpret natural language inputs (such as "a green leather purse shaped like a pentagon" or "an isometric view of a sad capybara") and generate corresponding images.[2] It can create images of realistic objects ("a stained glass window with an image of a blue strawberry") as well as objects that do not exist in reality ("a cube with the texture of a porcupine").[3][4][5] Its name is a portmanteau of WALL-E and Salvador Dalí.[2][1]
Many neural nets from the 2000s onward have been able to generate realistic images.[2] DALL-E, however, is able to generate them from natural language prompts, which it "understands [...] and rarely fails in any serious way".[2]
OpenAI has not released source code for either model, although a "controlled demo" of DALL-E is available on OpenAI's website, where output from a limited selection of sample prompts can be viewed.[1] Open-source alternatives, trained on smaller amounts of data, like DALL-E Mini, have been released by others.[6]
According to MIT Technology Review, one of OpenAI's objectives was to "give language models a better grasp of the everyday concepts that humans use to make sense of things".[7]
History
DALL-E was revealed by OpenAI on January 5, 2021.[7]
In April, 2022, OpenAI announced DALL-E 2, claiming that it can produce photo-realistic images from textual descriptions, along with an editor that allows simple modifications to the output. As of the announcement, the software was stated to still be in the research phase, with access limited to pre-selected beta users. The model can still make serious errors, including errors that no human would make.[8] DALL-E 2 was described as a model that "can create original, realistic images and art from a text description. It can combine concepts, attributes, and styles."[9]
CLIP
DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training).[7] CLIP is a separate model whose role is to "understand and rank" its output.[2] The images that DALL-E generates are curated by CLIP, which presents the highest-quality images for any given prompt.[7]
Architecture
The Generative Pre-trained Transformer (GPT) model was initially developed by OpenAI in 2018,[10] using the Transformer architecture. The first iteration, GPT, was scaled up to produce GPT-2 in 2019;[11] in 2020 it was scaled up again to produce GPT-3, with 175 billion parameters.[12][1][13]
DALL-E's model is a multimodal implementation of GPT-3[14] with 12 billion parameters[1] which "swaps text for pixels", trained on text-image pairs from the Internet.[7] It uses zero-shot learning to generate output from a description and cue without further training.[15]
DALL-E generates multiple images in response to prompts. CLIP understands and ranks these images.[2] CLIP was trained on over 400 million pairs of images and text.[1] CLIP is an image recognition system;[16] that was trained on images and descriptions scraped from the Internet, rather than however a curated dataset of labeled images (such as ImageNet).[7] CLIP associates images with entire captions.[7] CLIP was trained to predict which caption (out of a "random selection" of 32,768 possible captions) was most appropriate for an image, allowing it to subsequently identify objects in images outside its training set.[7]
Performance
DALL-E is capable of generating imagery in a variety of styles, from photorealistic imagery[1] to paintings and emoji. It can also "manipulate and rearrange" objects in its images.[1] One ability noted by its creators was the correct placement of design elements in novel compositions without explicit instruction: "For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL·E often draws the handkerchief, hands, and feet in plausible locations."[17]
While DALL-E exhibited a wide variety of skills and abilities, on the release of its public demo, most coverage focused on a small subset of "surreal"[7] or "quirky"[18] output images. Specifically, DALL-E's output for "an illustration of a baby daikon radish in a tutu walking a dog" was mentioned in pieces from Input,[19] NBC,[20] Nature,[21] and other publications.[1][22][23] Its output for "an armchair in the shape of an avocado" was similarly noted.[7][24] In contrast, DALL-E's unintentional development of visual reasoning skills sufficient to solve Raven's Matrices (visual tests often administered to humans to measure intelligence) was observed.[25]
Despite this, DALL-E was described as "remarkably robust to such changes" and reliable in producing images for a wide variety of arbitrary descriptions.[2] Sam Shead, reporting for CNBC, called its images "quirky" and quoted Neil Lawrence, a professor of machine learning at the University of Cambridge, who described it as an "inspirational demonstration of the capacity of these models to store information about our world and generalize in ways that humans find very natural". Shead also quoted Mark Riedl, an associate professor at the Georgia Tech School of Interactive Computing, as saying that DALL-E's demonstration results showed that it was able to "coherently blend concepts", a key element of human creativity, and that "the DALL-E demo is remarkable for producing illustrations that are much more coherent than other Text2Image systems I’ve seen in the past few years."[18] Riedl was also quoted by the BBC as saying that he was "impressed by what the system could do".[24]
DALL-E's ability to "fill in the blanks" and infer appropriate details without specific prompts was remarked on as well. ExtremeTech noted that a prompt to draw a penguin wearing a Christmas sweater produced images of penguins not also wearing sweaters, but also thematically related Santa hats,[26] and Engadget noted that appropriately-placed shadows appeared in output for the prompt "a painting of a fox sitting in a field during winter".[15] Furthermore, DALL-E exhibits broad understanding of visual and design trends; ExtremeTech said that "you can ask DALL-E for a picture of a phone or vacuum cleaner from a specified period of time, and it understands how those objects have changed".[26] Engadget also noted its unusual ability of "understanding how telephones and other objects change over time".[15] DALL-E has been described, along with other "narrow AI" like AlphaGo, AlphaFold and GPT-3 as "[generating] interest in whether and how artificial general intelligence may be achieved".[27]
Implications
OpenAI claimed that it planned to "analyze the societal impacts"[28] and "potential for bias" before releasing it for general use.[18] Earlier models were susceptible to misuse, because they were not able to defend against it.
DALL-E heralds "the dawn of a new AI paradigm known as multimodal AI", in which systems would be capable of combining data from and translating data between multiple information types. Further, DALL-E has been cited as an example of software that exhibits creativity.[29]
References
- ^ a b c d e f g h i Johnson, Khari (5 January 2021). "OpenAI debuts DALL-E for generating images from text". VentureBeat. Archived from the original on 5 January 2021. Retrieved 5 January 2021.
- ^ a b c d e f g Coldewey, Devin (5 January 2021). "OpenAI's DALL-E creates plausible images of literally anything you ask it to". Archived from the original on 6 January 2021. Retrieved 5 January 2021.
- ^ Grossman, Gary (16 January 2021). "OpenAI's text-to-image engine, DALL-E, is a powerful visual idea generator". VentureBeat. Archived from the original on 26 February 2021. Retrieved 2 March 2021.
- ^ Andrei, Mihai (8 January 2021). "This AI module can create stunning images out of any text input". ZME Science. Archived from the original on 29 January 2021. Retrieved 2 March 2021.
- ^ Walsh, Bryan (5 January 2021). "A new AI model draws images from text". Axios. Retrieved 2 March 2021.
- ^ Dayma, Boris; Patil, Suraj; Cuenca, Pedro; Saifullah, Khalid; Abraham, Tanishq; Lê Khắc, Phúc; Melas, Luke; Ghosh, Ritobrata (2021), DALL·E Mini, doi:10.5281/zenodo.5146400, retrieved 2021-11-29
- ^ a b c d e f g h i j Heaven, Will Douglas (5 January 2021). "This avocado armchair could be the future of AI". MIT Technology Review. Retrieved 5 January 2021.
- ^ KAHN, JEREMY (April 6, 2022). "Move over Photoshop: OpenAI has just revolutionized digital image making". Fortune. Retrieved 2022-04-10.
- ^ "DALL·E 2". OpenAI. Retrieved 6 April 2022.
- ^ Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived (PDF) from the original on 26 January 2021. Retrieved 23 January 2021.
- ^ Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilua (14 February 2019). "Language models are unsupervised multitask learners" (PDF). 1 (8). Archived (PDF) from the original on 6 February 2021. Retrieved 19 December 2020.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (July 22, 2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
- ^ Ramesh, Aditya; Pavlov, Mikhail; Goh, Gabriel; Gray, Scott; Voss, Chelsea; Radford, Alec; Chen, Mark; Sutskever, Ilya (24 February 2021). "Zero-Shot Text-to-Image Generation". arXiv:2101.12092 [cs.LG].
- ^ Tamkin, Alex; Brundage, Miles; Clark, Jack; Ganguli, Deep (2021). "Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models". arXiv:2102.02503 [cs.CL].
- ^ a b c Dent, Steve (6 January 2021). "OpenAI's DALL-E app generates images from just a description". Engadget. Archived from the original on 27 January 2021. Retrieved 2 March 2021.
- ^ "For Its Latest Trick, OpenAI's GPT-3 Generates Images From Text Captions". Synced. 5 January 2021. Archived from the original on 6 January 2021. Retrieved 2 March 2021.
- ^ Dunn, Thom (10 February 2021). "This AI neural network transforms text captions into art, like a jellyfish Pikachu". BoingBoing. Archived from the original on 22 February 2021. Retrieved 2 March 2021.
- ^ a b c Shead, Sam (8 January 2021). "Why everyone is talking about an image generator released by an Elon Musk-backed A.I. lab". CNBC. Retrieved 2 March 2021.
- ^ Kasana, Mehreen (7 January 2021). "This AI turns text into surreal, suggestion-driven art". Input. Archived from the original on 29 January 2021. Retrieved 2 March 2021.
- ^ Ehrenkranz, Melanie (27 January 2021). "Here's DALL-E: An algorithm learned to draw anything you tell it". NBC News. Archived from the original on 20 February 2021. Retrieved 2 March 2021.
- ^ Stove, Emma (5 February 2021). "Tardigrade circus and a tree of life — January's best science images". Nature. Archived from the original on 8 March 2021. Retrieved 2 March 2021.
- ^ Knight, Will (26 January 2021). "This AI Could Go From 'Art' to Steering a Self-Driving Car". Wired. Archived from the original on 21 February 2021. Retrieved 2 March 2021.
- ^ Metz, Rachel (2 February 2021). "A radish in a tutu walking a dog? This AI can draw it really well". CNN. Retrieved 2 March 2021.
- ^ a b Wakefield, Jane (6 January 2021). "AI draws dog-walking baby radish in a tutu". British Broadcasting Corporation. Archived from the original on 2 March 2021. Retrieved 3 March 2021.
- ^ Markowitz, Dale (10 January 2021). "Here's how OpenAI's magical DALL-E image generator works". TheNextWeb. Archived from the original on 23 February 2021. Retrieved 2 March 2021.
- ^ a b Whitwam, Ryan (6 January 2021). "OpenAI's 'DALL-E' Generates Images From Text Descriptions". ExtremeTech. Archived from the original on 28 January 2021. Retrieved 2 March 2021.
- ^ Nichele, Stefano (2021). "Tim Taylor and Alan Dorin: Rise of the self-replicators—early visions of machines, AI and robots that can reproduce and evolve". Genetic Programming and Evolvable Machines. 22: 141–145. doi:10.1007/s10710-021-09398-5. S2CID 231930573.
- ^ Macaulay, Thomas (6 January 2021). "Say hello to OpenAI's DALL-E, a GPT-3-powered bot that creates weird images from text". TheNextWeb. Archived from the original on 28 January 2021. Retrieved 2 March 2021.
- ^ Toews, Rob (18 January 2021). "AI And Creativity: Why OpenAI's Latest Model Matters". Forbes. Archived from the original on 12 February 2021. Retrieved 2 March 2021.