A Conversational System for Interactive Image Editing

Seitaro Shinagawa


Systems with natural language interfaces such as conversational interface are useful for human users in human-system collaboration tasks. Interactive image editing task is a task that uses natural language interfaces, which is a potential application for non-skilled users. If users want to create an imagined image, they can ask the system to create the image as we usually do with skilled workers. This thesis presents an interactive image editing system based on neural network image generative models, which proactively communicates with users to create the desired image.

This thesis addresses the following two challenging problems for the interactive image editing task. The first problem is that the systems have to handle various editing requests from the users in natural language, which include requests for a slight change of images. This problem gives us an adventure over the previously known image retrieval systems, which can only provide images existing in the database. Chapter 3 addresses this problem in our editing system. We propose an interactive image editing framework based on machine learning systems, neural network-based image generative models. This framework aims at training a model to automatically learn relationships between the change of images and the natural language editing requests. The model can directly estimate and generate new images from given previous (source) images and the users' natural language instructions (editing requests) to generate a fixed image as the user demanded. To evaluate the model in this framework, (i) we demonstrate how our editing model works in an artificial dataset, which is automatically created from a public handwritten dataset. (ii) We also present a more practical task, avatar face illustration editing, whose instructions are collected from human annotators. These tasks require systems to handle various editing requests. With these datasets, we present a difficulty in deciding the editing region of the source image that is mentioned in the editing requests. We propose source image masking (SIM) to mitigate this difficulty. The SIM explicitly identifies the region mentioned in the editing request. We demonstrate that the system with the SIM architecture (w/ SIM) outperforms the system without the SIM architecture (w/o SIM) in most of the editing requests in the dataset.

The second problem is that the systems have to handle the uncertainty of the generated images due to the diversity of editing requests. Machine learning models are trained with the limited dataset, and in general, the users have different knowledge, skills, and cultures. Therefore, miscommunication between the systems and the users is inevitable. In other words, it is difficult for a single editing model to perform correctly for all editing requests. This problem forces the users to learn how the system behaves through many trial-and-errors, which bothers the users. Conventional image generation or editing systems have the same problem. A naïve strategy for the system to solve the problem is showing the generated images from different editing models to confirm the most relevant image to the user’s request every time. However, this strategy has a problem that the system makes the interactive process redundant. Chapter 4 addresses this problem in our editing system. We propose a proactive confirmation method that enables the system to confirms with the user when the system is tentative about selecting a better image to match the user's editing requests. We defined an uncertainty score by using the entropy of the generated image to decide the system action to confirm. We demonstrate our method achieves a lower number of confirmation to the users with better image qualities through the dialogues.