In Conversation
Alejandro Cartagena, co-founder of Fellowship, discusses alignDraw with the artist Elman Mansimov.
Alejandro speaks to Elman Mansimov
Alejandro Cartagena: Tell us about how you were introduced to computer science and how it played a role in your upbringing.
Elman Mansimov: When I first encountered computer science, I was growing up in Baku, Azerbaijan. At that time, in terms of technology and computers, we were not as advanced as the West.
During my middle school years in the 2000s, computer science in Azerbaijan was referred to more as applied mathematics or informatics. I'm not certain if this is a common terminology elsewhere. I recall very basic programming classes at school where we learned languages like Pascal. One of my earliest memories with computers involved creating computer-generated shapes such as triangles and circles.
It felt like a rudimentary version of Windows Paint. But I distinctly remember the challenges and frustrations I faced during debugging. So, I didn't warm up to computer science right away. I've always been a technically-inclined individual and excelled in sciences. As I neared high school graduation, I considered pursuing a career in the oil industry, given that Azerbaijan is an oil-rich country.
However, my math teacher advised me to consider computer science. She believed that while I had a technical aptitude, I might not have the patience required for pure math. Instead, she felt a field with a quicker feedback loop, like computer science, would suit me. Based on her advice, I sought and received a full scholarship to study computer science at the University of Toronto in 2011. From then on, I was wholly committed to the subject.
Around 2014, AI, though not as expansive as it is now, was showing significant promise. Figures like Andrew Ng from Stanford and Geoff Hinton from the University of Toronto were making remarkable advancements. The limitations of traditional algorithms became apparent when dealing with complex tasks, such as differentiating between a cat and a dog based on varying features. Machine learning appeared to be the next logical step.
Elman Mansimov
Cartagena: While at the University of Toronto, what were your main areas of research and experimentation before you arrived at alignDRAW?
Mansimov: During my time at the University of Toronto, particularly in the summers of my second and third years, I undertook research. However, my initial projects were not focused on machine learning.
In my second year, my research was related to computer systems, emphasizing networks. By my third year, I was working in a human-computer interaction lab, focusing on designing iPad app interfaces for the elderly. But as I engaged with these projects and attended various computer science classes, I felt increasingly drawn to machine learning and AI by the end of my third year.
I perceived it as a realm of vast opportunities with high impact. Around 2014, AI, though not as expansive as it is now, was showing significant promise. Figures like Andrew Ng from Stanford and Geoff Hinton from the University of Toronto were making remarkable advancements. The limitations of traditional algorithms became apparent when dealing with complex tasks, such as differentiating between a cat and a dog based on varying features. Machine learning appeared to be the next logical step.
Fortuitously, one of the graduate students I worked with during those summers introduced me to a supervisor. I began my focused research in AI and deep learning in the fall of 2014. This eventually led to my involvement with alignDRAW in the summer of 2015. From there, I continued my research in deep learning and began preparing for graduate school admissions.
Elman Mansimov introducing the alignDRAW model:
There was significant advancement in image captioning around that time. Given my familiarity with generative models and newly acquired research skills, I pondered over inverting the captioning process. Instead of transitioning from image to text, I envisioned going from text to image. My core hypothesis before embarking on alignDRAW was that image generation should be a recurrent, iterative process, rather than an instantaneous creation.
Elman Mansimov
Cartagena: What was the prevailing atmosphere surrounding AI during that time? Was there growing enthusiasm outside of the research realm, or did you sense a cultural shift?
Mansimov: In that academic year, I began immersing myself in lab work alongside a grad student, grasping the nuances of research. Around that period, as I participated in grad school seminars and discussions with other grad students, there was a palpable sense of ascension in the field.
Several of my lab colleagues at the University of Toronto were deeply involved in developing image captioning systems. An unmistakable buzz surrounded a New York Times article towards the end of 2014, which highlighted an emerging image-to-text system. Concurrently, the sequence-to-sequence model, especially in machine translation with neural networks, was just beginning to demonstrate its potential. Technically speaking, while 2012 and 2013 witnessed a focus on image and text classification models, 2014 heralded significant advancements. There was a pivotal paper from Ilya Sutskever, Oriol Vinyals and Quoc Le on machine translation and another from my future thesis supervisor, Kyunghyun Cho together with his collaborators Dzmitry Bahdanau and Yoshua Bengio, which seamlessly aligned and translated images and text. The image captioning efforts, which employed techniques from machine translation, were especially groundbreaking. This indicated a clear trajectory for AI, particularly deep learning, which was no longer limited to image classification. It was advancing towards text and structured generation. Some even ventured into generating diminutive digits. Given the success stories of image captioning and machine translation, I felt that a forward leap was on the horizon, particularly in generating images from captions, a considerably challenging endeavor.
Beyond the research realm, public awareness was mounting. As I mentioned, the New York Times' coverage of image captioning in 2014 was significant, marking one of the initial instances of mainstream media spotlighting deep learning outside of pure academic circles. I also recall anecdotes of Mark Zuckerberg's appearance at a NeurIPS conference in 2013, indicating that influential figures were beginning to take note. With events like DeepMind's acquisition and the surge of attention around Atari games, it was evident that the broader world was turning its gaze towards AI.
Cartagena: What inspired you to start developing alignDRAW?
Mansimov: There were primarily two motivators for my venture into alignDRAW. First, during the academic year 2014-2015, I collaborated extensively with a grad student Nitish Srivastava on video understanding. We utilized sequence-to-sequence recurrent neural nets to predict future frames in videos and use these predictions to enhance video understanding tasks. This approach had striking parallels with how GPT-4 operates, predicting the next word to gain a better understanding for tasks like alignment. Though our work was on a smaller scale with less sophisticated neural networks, the core idea was similar.
Secondly, there was significant advancement in image captioning around that time. Given my familiarity with generative models and newly acquired research skills, I pondered over inverting the captioning process. Instead of transitioning from image to text, I envisioned going from text to image. My core hypothesis before embarking on alignDRAW was that image generation should be a recurrent, iterative process, rather than an instantaneous creation. The model should attentively refer to relevant text sections during the image formation process. The emergence of the DRAW model, which progressively drew images patch by patch, further validated my hypothesis. I felt an impulse to extrapolate this model to encompass text-to-image generation.
Cartagena: Could you describe the process of building alignDRAW, including its duration and the number of failed attempts?
Mansimov: Building alignDRAW began in earnest around the summer of 2015. After finishing school, I stayed on for another year as a paid research assistant with Professor Ruslan Salakhutdinov. My initial endeavors included preparing the Microsoft COCO dataset and implementing the DRAW model using Theano, a deep learning framework of that period. By mid-summer, I commenced training the inaugural models for alignDRAW, starting with generating digits based on specific textual instructions.
To my astonishment, early results garnered positive feedback from academics at the ICML conference I attended in France in July 2015. Bolstered by this affirmation, I ventured into more intricate image generation tasks using the Microsoft COCO dataset. By August, the results looked promising. Admittedly, the debugging process was cumbersome given the Theano framework's long compilation times. It was tempting to rewrite the entire code, but I persisted with the original.
As for failed attempts, the initial model already showed promise with digit generation. However, generating authentic images from the Microsoft COCO dataset required several iterations. One notable enhancement involved sharpening the slightly blurry images the model produced, adding a mechanism to make them appear more photorealistic.
The current state of AI-driven art, as seen in systems like alignDraw, is somewhat standalone. Artists currently interact with these systems in a detached manner, inputting prompts and iterating based on outputs. However, I believe we're yet to perfect the interface for seamless artist-AI collaboration. The ideal future isn't about generating a complete image all at once; it's about starting with a foundational idea or sketch and working collaboratively with the AI.
Elman Mansimov
Cartagena: Could you explain the core concept behind alignDRAW for those who may not be familiar with it?
Mansimov: The primary idea of alignDRAW is somewhat straightforward. Initially, there's a text encoder that encodes the text into a specific representation. This encoder is a bi-directional LSTM, which, in layman's terms, reads the sentence from both left to right and right to left. Each word then has its representation. The readings from both directions are concatenated into one unified representation, which is then processed by the model called alignDRAW.
On the decoding side, alignDRAW is an extension of the DRAW model. Here, we begin with an image canvas from the previous step. This canvas is then encoded into a latent representation. From this representation, a new "residual" of the canvas is generated and added to the previous image. This iterative generation process begins with an empty canvas. A latent representation is sampled, from which a new residual is created and added to the canvas. This procedure continues for several steps, all while conditioning the decoding step on the text's representation. The final step involves refining the image. That sums up the core concept.
Cartagena: What challenges did you encounter during the development of alignDRAW, and how did you overcome them?
Mansimov: There were mainly two challenges. The first was the implementation. Creating the frameworks took time, and debugging required resources. We faced constraints at the University of Toronto, especially regarding GPU access. The second challenge was ensuring the generation of realistic images. This required iterative tweaking of the model, its hyperparameters, and sharpening details to enhance the images' clarity.
Cartagena: How did you determine that the model was successful, and what were your feelings when you made this discovery?
Mansimov: To be honest, I never felt the model was entirely successful, perhaps because my standards were quite high. Although our model could produce fairly realistic images, it wasn't at the level of what you'd see with GANs around 2019 or 2018, which had lifelike face generation capabilities. I did, however, believe our results were significant enough for a well-received paper.
It was only later, when I shared the paper with a postdoc, Roger Grosse from the University of Toronto, that I began to see its value. He was genuinely impressed by our generated images, stating he never imagined a neural network could achieve such results. This feedback made me reconsider the model's impact.
But when preparing the paper for submission, we wanted to immediately grab readers' attention. I realized there was skepticism about our model's ability to understand and generate objects based on captions rather than merely memorizing from datasets. I aimed to use strong, representative images in the paper to debunk this notion.
Cartagena: The prompts used in your work are quite unconventional. How did you come to realize that using "nonsensical" descriptions was necessary to demonstrate the model's capabilities?
Mansimov: I examined some representative objects in the dataset and considered how I might combine scenes or objects in an unusual way to convince people of the model's generalization capabilities. For instance, even if the dataset had never contained the word "helicopter", my hypothesis was that if the model had encountered the concept of "sea" and "giraffe", a prompt like "a giraffe standing on the sea" should produce an image of a giraffe with a blue patch beneath it. This hypothesis led me to create prompts like a "stop sign in the sky" or "toilets in the grass". My initial goal wasn't to craft artistic or nonsensical prompts, but to demonstrate that the model was comprehending the scenes and objects in the data beyond mere memorization.
Cartagena: How was the reception of your discovery within the academic and research communities?
Mansimov: Regarding the reception, as I mentioned, the academic feedback was positive. Our paper was accepted to the ICLR in 2016 in Puerto Rico. It was even selected for an oral presentation, a distinction given to only 2% or 3% of submitted papers.
The scientific community was excited about our novel approach to realistic text-to-image synthesis. The press also took note. A journalist from Vice interviewed me, specifically highlighting the "toilet in the grass" image. At a conference in December 2015, attendees recognized me as the one behind the "toilet in the grass" paper, which I found amusing and validating.
Cartagena: Reflecting on the fact that you were only 19 years old when you began this project, does that realization impress you?
Mansimov: Indeed, I was 19 when I began this project. At the time, I wasn't dwelling on my age. I felt I was doing something significant and valuable. My primary goal was to hone my skills and enhance my academic prospects in deep learning. While I anticipated progress in generative models and image generation, I hadn't foreseen this technology becoming as mainstream and impactful as it has.
Cartagena: Considering the current state of text-to-image models, how do you perceive alignDRAW? Did you ever anticipate that this technology would become so efficient and mainstream?
Mansimov: For me, it's a delightful realization that people view this work as one of the pioneers in text-to-image generation. I never set out with the intention of being a "first" in this area. My goal was to contribute meaningful research. It's heartening to see that the work has not only achieved its primary objective but has also left a mark in the field.
Cartagena: In the realm of art, these works represent the dawn of a new approach to image creation. Did you ever anticipate that they would have such significant implications for the future of art?
Mansimov: Honestly, I hadn't fully grasped the potential impact on the art world at the outset. At that time, the technology was still in its infancy. Much of our focus was on how to refine and improve it. The broader implications, whether on art, jobs, or safety, became more pronounced around 2019, notably driven by OpenAI.
Discussions with forward-thinkers like Nick Bostrom posed philosophical arguments about AI's potential dangers and benefits. Yet, as someone deeply involved with the technology, some of these concerns felt overestimated, at least initially. However, when OpenAI, as creators of technologies like GPT-2, began emphasizing safety and the societal implications of generative models, the discourse around AI's broader consequences gained substantial traction.
Cartagena: alignDRAW explores the boundaries between human creativity and artificial intelligence. What are your thoughts on the role of AI in augmenting human creativity?
Mansimov: From a computer science perspective, AI is essentially a form of software. The primary purpose of any software, whether it's a spreadsheet application or a document editor, is to enhance human productivity. This augmentation allows us to accomplish more in less time, thereby granting us more opportunities to savor life and its experiences.
I've always envisioned AI as a tool designed to assist humans, not supplant them. The entire ecosystem, from the hardware like Nvidia GPUs to the software frameworks and the AI models themselves, is a product of human ingenuity. I firmly believe that AI is an extension of software that serves to empower humans in various capacities. In the realm of art, AI could act as a muse or catalyst, inspiring artists and providing them with foundational ideas to build upon and refine.
Cartagena: How do you envision the future of AI-driven art, and what possibilities do you see for collaboration between artists and AI systems?
Mansimov: The current state of AI-driven art, as seen in systems like alignDraw, is somewhat standalone. Artists currently interact with these systems in a detached manner, inputting prompts and iterating based on outputs. However, I believe we're yet to perfect the interface for seamless artist-AI collaboration. The ideal future isn't about generating a complete image all at once; it's about starting with a foundational idea or sketch and working collaboratively with the AI, iterating and enhancing specific portions.
While many have focused on refining the AI models, there's been less emphasis on the user interface. As AI-art technology becomes mainstream, it's time for artists to take the lead. They should advocate for tools built around an "AI-first" approach, designed to aid collaboration and inspire creativity. Ultimately, this will help artists be more productive and enjoy their craft even more.
Cartagena: Looking back over the eight years since the alignDRAW paper was published, how has the reception of these tools within the computer science community evolved?
Mansimov: Initially, the reception was primarily technical. For the first five years or so, discussions at conferences centered on different generative models and their technical nuances. Concerns about the societal implications, especially in the context of deepfakes, were there but remained peripheral. Interestingly, while many raised alarms about fake images, I found them easy to discern with proper training.
By 2020, the conversations began to shift. With advancements like Dall-E and stable diffusion gaining mainstream attention, the narrative broadened from pure technicality to commercial applications and societal impact. These changes have been amplified by improvements in unsupervised learning, GPU capabilities, and neural net technologies.
However, this evolution has its downsides. Many cutting-edge techniques are now proprietary, hidden behind the closed doors of commercial entities. This has stymied the open technical discourse that once characterized the field. Moreover, as AI-art gains prominence, debates about job displacement, societal impacts, and crediting original creators are becoming paramount.
Looking ahead, while technical discussions might take a backseat to these broader concerns, I foresee enhanced AI-human collaboration interfaces, a push for proper crediting mechanisms for online contributors, and a continuous evolution of AI in art.
Cartagena: As a fun final question, could you share your favorite prompt from the paper, and is there an interesting story behind how you came up with it?
Mansimov: Certainly! My absolute favorite prompt was the one featuring a stop sign set against blue skies. The moment I saw the output, it struck me: the model wasn't just memorizing data—it was genuinely generalizing from what it had learned. My personal affinity for the color blue—shaped by my childhood summers spent swimming in the sea—added to the appeal of this prompt. What amazed me further was the model's ability to interpret the stop sign as a small red object against a vast blue backdrop, even when the specific instruction didn't spell it out.
Another prompt I adored was the green school bus. The fact that the model could adapt and understand a typically yellow school bus could be green was evidence of its versatility. Similarly, the whimsical elephants floating in the sky were another delightful surprise.
But of all the prompts, the one that resonated most with the public was the toilet seated on green grass. Originally conceived with a touch of humor on my part, it unexpectedly became emblematic of the paper. It showcased AI's potential for creativity, its ability to blend disparate concepts, and its challenge to what we traditionally considered uniquely human artistic prowess. Years from now, I believe that when people recall our work, they'll reminisce about that imaginative toilet on green grass—a symbol of technology's unexpected foray into the realms of creativity.