AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts

Tongshuang Wu, Michael Terry, Carrie J. Cai

Introduction

Large language models (LLMs) have introduced new possibilities for human-AI collaboration (Bommasani et al., 2021). Pretrained on billions of inputs from the Internet (Hendrycks et al., 2020), generative models like GPT-3 can now perform a wide variety of tasks (Bommasani et al., 2021), ranging from translation (Brown et al., 2020), to question answering (Mishra et al., 2021), and even advanced story writing (Swanson et al., 2021). These successes are enabled by their ability to adapt to desired tasks purely using prompts, or natural language descriptions of the tasks (Reynolds and McDonell, 2021). For example, one could adapt an LLM to act as a translation engine, simply by providing a few examples of the desired inputs and outputs: “English: How are you? French: Comment allez-vous? English: Hello! French:” Based on this prompt, the model is likely to follow the pattern to output the correct French translation: “Bonjour!”

The relative ease of natural-language-based prompt programming suggests that LLMs may be useful assistants for real-world tasks, with users customizing the models to their own needs. In this light, recent work in Natural Language Processing (NLP) has begun to examine the algorithmic capabilities of LLMs, mostly on synthesized tasks (Wang et al., 2021; Floridi and Chiriatti, 2020; Reynolds and McDonell, 2021). However, many real-world tasks can be quite complex (e.g., outlining long essays, debugging software code), and may present challenges for current LLMs to solve from a single model run. For example, as LLMs learn the forms of language (Bender and Koller, 2020), they produce lower quality outputs when solving tasks that require multi-step reasoning (Tan et al., 2021; Branwen, 2020; Wei et al., 2022). Likewise, they may fail to capture the subtleties of many tasks that involve multiple objectives simultaneously (e.g., identifying and fixing multiple bugs in a code snippet). Figure 1 shows a task involving multiple concurrent objectives: (1) to rewrite peer feedback to be more friendly, and (2) to rewrite it with additional concrete suggestions, and (3) to ensure that each noted sub-problem (e.g., too many words on slides, presentation meaders, does not engage with audience) is addressed. While an LLM can both generate suggestions (gpt, [n. d.]a) and adjust the tone in isolation (e.g., in (sty, [n. d.])), it lacks the capability to perform both tasks together well in an end-to-end manner. As a result, it produces a mediocre paragraph that only meets a few requirements (see output of Figure 1A).

Besides being inherently limited for complex problems, LLMs are also difficult to interact and collaborate with, as they can be opaque and hard to debug. Since LLMs can take in any natural language prompts, end users may struggle to determine how to change their prompts to remedy unexpected model outputs. They may also have difficulties developing accurate mental models of an LLM’s capabilities and limitations. There are no obvious edits on the prompt that can, for instance, encourage the model to add more suggestions regarding “too much text on slides” in Figure 1A.

In this work, we introduce the notion of Chaining multiple LLM prompts together, to help users accomplish complex tasks with LLMs in a way that is more transparent and debuggable. Chaining takes advantage of LLMs’ unique ability to handle a variety of independent tasks. In a Chain, a problem is broken down into a number of smaller sub-tasks, each mapped to a distinct step with a corresponding natural language prompt; results of one or more previous steps are aggregated in the next step’s input prompt. Thus, Chaining enables users to run the same model on multiple sub-tasks, thereby granting each sub-task a higher likelihood of success (as opposed to solving the entire task in one go). In Figure 1B, while the underlying LLM remains the same, by splitting (i.e.,, extracting) presentation problems ( $b_{1}$ ) and ideating suggestions per problem ( $b_{2}$ ), the final composed paragraph ( $b_{3}$ ) is more comprehensive in addressing all problems, and has a more constructive tone.

In addition to potentially improving outcomes, Chaining opens up new channels for fine-grained human feedback and control. For example, thanks to the separate Ideation step in Figure 1 $b_{2}$ , Chaining allows users to customize which suggestions to include in the final paragraph, an operation that is unavailable without Chaining (Figure 1A). We develop an interactive interface to expose these additional “knobs” to end users. The interface visualizes the Chain structure, and allows users to customize a Chain at various levels: they can iterate on the local prompts in each step, edit intermediate data between steps, or modify the entire Chain. To inform the design of this tool, we surveyed 73 existing LLM use cases and summarized them into a set of LLM primitive operations, each with default prompting and data structures. They help inform what types of sub-tasks could be used within a Chain, as well as how those steps can feed into each other.

To evaluate the impact of Chaining on both task performance and user experience, we conducted a within-subject user study, in which 20 participants completed tasks using both Chaining and a standard (non-Chaining) interface, with the same underlying LLM powering all the steps in the Chaining interface, as well as the non-Chaining one. Our results show that Chaining significantly improved key dimensions of the human-AI experience: transparency, controllability, collaboration, and mental support. In addition, participants also achieved higher-quality outcomes $\sim$ 82% of the time using Chaining. We also saw participants leveraging Chaining for purposes beyond immediate task accomplishment — they calibrated their expectations of the model using the smaller scope of sub-tasks, explored alternative prompting strategies by comparing parallel downstream effects, and debugged unexpected model output by isolating and “unit-testing” different parts of a Chain. Critically, these improvements were achieved without changing the model itself. These findings suggest that one way to improve the explainability and debuggability of an otherwise opaque, black-box LLM is to have it do less: breaking a problem up into smaller problems, having the model solve each (smaller) problem separately, showing the intermediate results, and allowing users to edit those results.

The ability to chain LLM calls using a set of Chaining building blocks, within an interactive interface, collectively represents a novel method and system for prototyping new AI-powered tasks and features using LLMs. We conclude the paper with case studies illustrating how Chaining can support more diverse applications in the future, as well as insights into challenges and opportunities that arose from our experiments. In summary, we contribute:

We introduce the notion of LLM Chaining. Through a series of chained model calls, each targeting a small and well-scoped sub-task, we adapt a single LLM to contribute to multiple sub-components of a task.

We design and implement building blocks for constructing and interacting with LLM Chains. These include a set of primitive LLM operations representing functions well-scoped for a single model run, and an interactive interface that displays the intra- and inter-step structures of a Chain. Users can run Chains step-by-step, and customize them at various granularities (editing intermediate model outputs, rewiring steps, etc.).

We report results from a 20-person evaluation that shows Chaining can increase system transparency, controllability, and task outcomes. Importantly, these gains are achieved without any changes to the underlying model. Combined with the case studies, we demonstrate the potential of improving explainability and debuggability of LLMs through task decomposition and finer-grained application of LLM models.

Taken together, our findings inform the design and research of future human-LLM collaborative systems, an area of critical importance in years to come.

Background and Related Work

A generative language model is primarily designed to continue its input with plausible output (e.g., given a prompt “I went to the”, it might auto-complete with “coffee shop”). However, when pre-trained on billions of samples from the Internet, recent transformer-based LLMs (Vaswani et al., 2017) like GPT-3 (Brown et al., 2020) and Jurassic-1 (Lieber et al., 2021) encode enough information to support additional in-context learning: they can be easily customized at run time (without any re-training needed) to handle new tasks beyond text continuation. To invoke the desired functionality, users write natural language instructions, or prompts (Lu et al., 2021; Betz et al., 2021; Liu et al., 2021), that are appropriate for the task. The most common patterns for prompting are either zero-shot or few-shot prompts. Zero-shot prompts directly describe what ought to happen in a task. For example, we can enact English-to-French translation with a prompt such as “Translate the sentence “Do you like the weather?” to French:”. In contrast, few-shot prompts show the LLM what pattern to follow by feeding it examples of desired inputs and outputs: “[English] Hello! [French] Bonjour! [English] Do you like the weather? [French]”. Given either of these prompts, the LLM may respond with the French translation “Vous aimez le temps?” (Jozefowicz et al., 2016). Importantly, such task customization happens on the fly and, as a result, a single LLM can be flexibly adapted to a wide variety of use cases like code generation, question answering, creative writing, etc. (Brown et al., 2020; Swanson et al., 2021). This flexible adaptation, together with the text-in, text-out structure, creates an intuitive natural language interface between humans and the model.

Despite their versatility, LLMs require careful prompt design. Various studies therefore focus on prompt engineering (Lu et al., 2021; Betz et al., 2021; Liu et al., 2021). As manual prompting can be sub-optimal, some work automatically mines more effective prompts. However, the mined prompts tend to be less human-readable (Shin et al., 2020) and therefore less compatible with human-AI interaction. Conversely, strategies like progressive generation (i.e., multi-round text expansion) (Tan et al., 2021) and meta-prompting (i.e., asking the model to elaborate on the problem) (Betz et al., 2021; Reynolds and McDonell, 2021) attempt to seed LLMs to generate more effective prompts before solving the task. In essence, these approaches also adopt the spirit of multi-step problem solving, but focus on expanding the context without human intervention. Our work defines Chaining more comprehensively, with primitive operations that illustrate LLM capabilities, LLM steps that can add or remove information along the Chain, and editable intermediate data points.

2. Human-AI Collaboration

Human-AI interaction has been explored in domains such as classification (Bansal et al., 2021; Smith-Renner et al., 2020), drawing (Oh et al., 2018; Davis et al., 2016), translation (Green et al., 2014), creative writing (Clark et al., 2018; Gero and Chilton, 2019), and design ideation (Koch et al., 2019). Prior work has noted core challenges of the interaction, such as a lack of transparency, controllability, and user agency (Amershi et al., 2019; Buschek et al., 2021; Huang et al., 2020). Through Chaining, we aim to address these user-centered concerns.

In a collaboration, AI can play various roles, such as casual creators that encourage exploration (Davis et al., 2016) or assistants that compensate for human weaknesses (Levy et al., 2021; Wu et al., 2021). For example, Gero et al. (Gero and Chilton, 2019) showed that generators could serve as cognitive offloading tools so that humans could focus their attention where it is needed most, a core motivation that we also share. Cai et al. (Cai et al., 2019) investigated how a medical AI can assist with doctors’ decision-making process during prostate cancer diagnosis, by helping them compare and contrast similar images. Most of these studies, however, use task-specific models, and therefore limit observations to human interaction with AI that primarily serves one function, or in one domain (e.g., writing, medicine, music, etc.). DuetDraw (Oh et al., 2018) may be an exception to this, as it uses several models, each of which supports a different co-drawing functionality. Rather than training multiple models for different tasks, or using a single model for a single type of task, our work explores how a single large language model (with inherently customizable capabilities) can support humans in a variety of sub-tasks. Finally, the closest work to ours might be online interfaces for users to interactively create promptshttps://gpt3demo.com/apps/openai-gpt-3-playground, or interfaces enabling users to perform natural language programming of code using a large language model (Jiang et al., 2021). These systems used prompt engineering to create a set of programming-related functionality for users. While this prior work focused on single prompts, our work looks at how Chaining multiple prompts can address a much wider range of human tasks, and evaluate its effects on user experience.

3. Workflows in Crowdsourcing

Though less prevalent in human-AI collaboration, the concept of Chaining is inspired by concepts of “pipelining” and “microtasking,” which have long been used in crowdsourcing (Cai et al., 2016; Teevan et al., 2016). In crowdsourcing, requesters break down complex tasks into pieces that can be performed independently, then combined (Chilton et al., 2013; Kim et al., 2017; Law and Zhang, 2011; Retelny et al., 2017). Previous research shows that decomposed tasks allow the completion process to become more structured (Chilton et al., 2016) and more resilient to interruptions (Cheng et al., 2015), something we also witness in our user study. The goal of crowd workflows is typically to address and safeguard against the limitations of a typical worker. For example, Bernstein et al. (Bernstein et al., 2010) ensured text editing quality through a Find-Fix-Verify workflow, which modulates the scope of sub-tasks to reduce variance of crowdworker effort. Meanwhile, Context Trees (Verroios and Bernstein, 2014) hierarchically summarize and trim the otherwise overwhelming global contexts, making them compact enough for a single worker to digest.

Our Chaining approach also aims to address pitfalls of a single LLM pass, but the pitfalls are somewhat distinct. While crowdsourcing focuses more on cognitive load and task duration — factors that can affect the performance of human workers (Kulkarni et al., 2011) — for LLMs with intensive computing power, their limitations err towards a lack of reasoning abilities, high variance of prompt effectiveness, and exposure bias. A thorough analysis of these AI issues is needed for constructing and chaining LLM steps, which we illustrate in Section 3.1, and address through the design of primitive operations in Table 2. Through user studies (Section 5) and case studies (Section 6), we demonstrate that Chaining can effectively address these issues. Finally, our work also shares challenges found in crowdsourcing workflows, such as handling cascading errors that affect later stages (Kittur et al., 2011), staged crash-and-rerun (Little et al., 2010), all of which we take into consideration in the design of the Chaining structure. Beyond this, we advance the field by examining how core features of Chaining (e.g., cascading effects, parallel paths) are used not only to accomplish tasks, but also to aid in increasing the transparency and debuggability of AI.

Chaining LLMs

Despite the impressive capabilities of LLMs, there may be contexts in which LLM performance would suffer, such as if the data is formatted sub-optimally, if there is extraneous data in the input, if the task inherently demands solving multiple sub-parts, or if the user is asking the model to perform several tasks at once. Meanwhile, LLMs may perform highly targeted tasks well. By narrowing the scope and context of an LLM operation, for example, LLMs may themselves be useful for addressing some of their own challenges (e.g., removing extraneous data, splitting problems into sub-parts, etc.). Thus, we hypothesize that decomposing a problem into smaller, highly targeted tasks is likely to increase model performance on those sub-tasks, and by extension, the overarching task.

We define Chaining as the process of breaking up complex tasks into smaller steps, where each step can be completed by an independent run of an LLM, and where the output of one or more steps is used as input for the next. To identify tasks that are most likely to benefit from Chaining, we first surveyed existing language modeling literature, and summarized common challenges LLMs face. As described in Section 3.1, these challenges are caused by the underlying modeling structure shared by the mainstream LLMs, including but not limited to GPT-3, Jurassic-1, and the internal LLM used in Section 5 and 6. Then, to identify promising sub-tasks that could be used as building blocks, we surveyed existing online demos of LLMs, and curated a list of primitive LLM operations, which may help overcome those challenges by scoping the inputs/outputs to be more amenable to what an LLM can handle.

Existing literature exposes three main challenges that LLMs face:

LLMs lack multi-step reasoning capabilities. Because LLMs are designed to grasp the form of language, rather than the meaning (Bender and Koller, 2020), they can struggle on tasks like sequential arithmetic problems, multi-hop question answering, recognizing and comparing sentences, or those that require branching logic (Betz et al., 2021; Floridi and Chiriatti, 2020; Wang et al., 2021; Branwen, 2020; Wei et al., 2022).

LLMs suffer from exposure bias (Tan et al., 2021; Ranzato et al., 2016). Because LLMs generate text sequentially in an autoregressive manner (the tokens generated by the models are themselves used to predict the next word), errors or imperfections from previous runs can accumulate. Thus, LLMs are less likely to perform well when generating long bodies of text. Exposure bias can also cause LLMs to produce redundant content, in some severe cases repeating the same phrase over and over again (Holtzman et al., 2020; Welleck et al., 2020). As a result, they struggle to generate text with diverse themes or arguments (e.g., suggestions for all three problems in the peer review example in Figure 1).

LLMs are sensitive to input prompts. They tend to favor certain prompt formats, paraphrases (O’Connor and Andreas, 2021; Lu et al., 2021), or even certain information in the input. For example, prompts that are unnatural relative to the typical text distribution tend to be less efficient (Branwen, 2020), while nouns and verbs are more important than adjectives and function words (O’Connor and Andreas, 2021).

These challenges tend to stem from tasks being too broad. Yet, as discussed above, LLMs may be able to perform certain tasks well if they are highly targeted, with narrower contexts. Hence, with these challenges in mind, we reviewed 73 existing demos based on an extensive search of official LLM websites, social media, and published case studies (these are enumerated in Table 2, Appendix A) to identify promising LLM capabilities that may help scope the inputs/outputs, culminating in a set of primitive operations. Note that the operations we identified may not be exhaustive, but rather represent an interesting range for study, with a variety of operations addressing each LLM challenge. Pilot studies — as well as use cases we present later — suggested these were a reasonable set to pursue. Full details of our methodology can be found in Appendix A.

Table 1 shows how the derived operations fall into three categories and can address the aforementioned challenges. First, as LLMs may have difficulty applying common sense reasoning or complex inference to nuanced problems (C.1), the Classification operation can act as a validation check or triage, before more steps are carried out (Table 1a). For example, a chatbot may need to first classify the type of question a user is asking before providing adequate responses. Second, to alleviate exposure bias (C.2, the inability to generate long and diverse text), some operations can be used to query small chunks of new content (Table 1b), so as to gradually build up the generation diversity and length. Three ways to get new content include querying facts, generating hallucinations, and ideating lists of items. For example, in the peer review rewriting scenario (Figure 1B), the separate Ideation step per problem prevents suggestions for one criticism from being influenced by the other two criticisms. Finally, because LLMs may struggle with certain input prompt types, reorganizing the prompt could be helpful when its original form is convoluted. Rewriting and Compose points transform input into more parsable forms, Information Extraction elicits concise information (C.3), and Split points splits text into smaller and more manageable units (C.1)—all are summarized in Table 1c. As we will see in a case study (Section 6.1), translating JSON-formatted specifications to natural language descriptions helps LLMs parse the embedded information.

Chaining and its operations also have some parallels to crowdsourcing workflows. However, whereas sub-tasks in crowdsourcing are assumed to be feasible for a human worker (reviewed in Section 2.3), LLMs are more restricted in terms of tasks they can perform reliably, and thus the primitive operations presented are more scoped and granular. For example, Kittur et al. (Kittur et al., 2011)’s Partition-Map-Reduce workflow uses Split and Compose Points operations (in Figure 1B), but does not indicate specifically how to transform the text (Ideation), though it also targets collaborative writing.

2. Designing Operations for LLM Chaining

An LLM Chain consists of multiple steps. Each step is defined by an LLM operation, which takes in input data and produces output data (which we call data layers). For example, the Split point operation in Figure 1 takes in the $\bullet$ initial feedback for Alex as input, and produces a list of $\bullet$ presentation problems (“too much text”, “no clear structure”, etc.) as output. LLM Chains are constructed by connecting these steps through shared data layers. In the same example above, the Ideation operation comes after the Split points operation, taking a (previously generated) $\bullet$ problem as input and producing $\bullet$ suggestions for improvements as output.

Each step of an LLM (an operation and its data layers) is accomplished through a natural language prompt. While prompts are task-dependent, they can have some task-agnostic properties. For example, the prompt for the Classification operation would likely contain the verb “classify”, regardless of what is being classified. These keywords help set an LLM operation’s scope and expectations (O’Connor and Andreas, 2021). We aim to abstract these task-agnostic properties into default parameters for each operation (Figure 2A), so as to provide consistent starting points for interacting with LLM Chains across use cases. Using the Ideation operation as an example, we show how we design these parameters to satisfy the following three requirements for chaining, and how they help to build the Ideation prompt shown in Table 1 and Figure 2B.

Operations need to invoke the desired functionalities, through prompt design. To date, the most common patterns for prompting are either zero-shot or few-shot prompts, depending on how many demonstrating examples are provided in the prompt (Brown et al., 2020). Zero-shot prompts directly describe what ought to happen in a task: e.g., we can enact Ideation with a task description prompt “Given Alex’s presentation problem, the following is a list of suggestions.” In contrast, few-shot prompts show the LLM what pattern to follow by feeding it examples of the desired input and output data: “Problem: mumbles when presenting, Suggestion: enunciate each syllable, Problem: too much text, Suggestion:” (full prompt in Figure 2B). Given these prompts, the LLM might produce a reasonable suggestion, e.g., “use more graphics on the slides.” Zero-shot prompts can also be easily transformed into few-shot prompts, by appending examples to the initial zero-shot task description. In either case, prompts commonly include meaningful names as prefixes (“Problem:” and “Suggestion:”) to demarcate structure, which helps re-emphasize the desired intent (Vaswani et al., 2017). Following this convention, we build our prompts to include task descriptions followed by prefixes. Aside from the prompt itself, we also associate with each LLM operation a default temperature setting: a model parameter that influences the randomness of the LLM generation. For instance, creative operations like Ideation benefit from a higher temperature ( $t\text{=}0.7$ ) than more factual or deterministic tasks like Classification ( $t\text{=}0.0$ ) (gpt, [n. d.]b).

Operations should be able to take custom data layers as inputs and outputs. Though our walkthrough example takes in “Alex’s presentation problem” and generates “Suggestions”, in theory an operation should be able to handle any custom data layers. We thus create prompt templates to support a wide range of scenarios, with placeholders for input and output data. The template allows us to build LLM steps simply by filling in the placeholders with definitions on data layers, as demonstrated in Figure 2. In particular, we include key verbs and nouns (O’Connor and Andreas, 2021) in the template, to best reflect the operation objective (e.g., “a list of” for Ideation, “classify” for Classification). The template also accepts optional few-shot examples. We can build the few-shot prompt in Figure 2B if we provide those pairs of problems and suggestions, or default to just the zero-shot version in Table 1 when examples are not readily available. Though we provide this as one example of a prompt template, we do not claim it to be exhaustive as there may be other equally effective ones.

Operations should handle parsing of the expected input/ output data types. Different data layers may take on different data types. For example, the Split step (Figure 1 $b_{1}$ ) produces a list of problems, but only a single problem is the input to each subsequent Ideation step ( $b_{2}$ ). To handle different formats in different steps, in each operation’s definition, we define the required data types per operation (e.g. “list” in Figure 2 for Ideation), along with the corresponding parsing necessary to produce the expected data type (e.g., split each row of the numbered list into an item).

Empirically, we find these defaults to work reasonably well across domains (see later sections 5 and 6). Still, we note that our defaults here are just one example of possible operation implementations; in our review of existing demos, there appeared to be many diverse prompting strategies even for the same task. We hope the prompt templates provided here may serve as a starting point for Chain designers or users to modify. In the next section, we demonstrate how these designs serve as the underlying data structure for interactive Chain execution by end-users.

Interactive User Interface

We designed an interface that helps users execute and customize LLM Chains interactively.

Over the course of several weeks, we designed and iterated on the prototype with feedback from four pilot users (software engineers and designers who have experience designing LLM prompts), producing three design rationales for the final interface.

Visually reflect the underlying Chaining structure. In early prototypes, we explained the Chain structure using a static slide deck that highlighted the data produced at each step (e.g., problems, suggestions for improvement, and final paragraph in Figure 1). In reaction, users expressed a desire to understand the operations taken at each step to arrive at these data layers (split points, ideation, compose points), and wanted to visually track progress through the Chain. To achieve this, we designed the interface to reflect not only the data layers, but also the LLM details within each step.

Provide controls at different granularities. Pilot users favored flexible controls. We observed users frequently making local fixes on intermediate data points that flow between LLM steps, and therefore designed the UI to allow in-place editing, without explicitly requiring a switch to editing mode. Some users also voiced an interest in iterating on alternative Chaining structures (“Can I change this step with…”). We therefore conclude that the interface should support modification of LLM Chains both locally (e.g., changing one task description or intermediate model output) and globally (e.g., changing how the steps are connected). Because global changes have more impactful consequences (they may overwrite the underlying Chain structure), we designed the UI to require a switch to editing mode for this type of changes.

The structured controls should still reflect the natural language interaction supported by LLMs. In an early prototype, we formatted the data as structured tables with each data layer being a column, but received feedback from two users that making text edits in cells felt unnatural as they lost the sense of interacting with the model through natural language. To retain a natural interaction experience, we keep these structures as in-line text fields.

2. Interface Design and Implementation

We design the interface in Figure 3 following these design rationales above, which consists of two primary views: the Chain view (Figure 3A), and the Step view (Figure 3B/C).

The Chain view (Figure 3A) depicts the high level Chaining structure through a flow chart. It contains three primary visual cues that closely reflect the underlying design (R.1) described in Section 3.2. First, we use grey glyphs to represent LLM operations, with shapes indicating 1-1 (rectangle, for operations like Rewriting in Table 1), 1-N (trapezoid, e.g., Ideation operation), and N-1 data mappings (inverted trapezoid, e.g., Compose points operation). Clicking on these glyphs allows users to choose which step to zoom into (highlighted in pink), and the Step view would change in response. Then, we use rectangles with colored stripes to represent data layers. Users can preview their data entries through white rows (e.g., Figure 3 $a_{1}$ and $a_{2}$ ), which are updated after each LLM execution, and thus track Chain execution progress. Finally, we link these elements with dotted-line arrows to highlight which data output serves as the input to which step, and use the number of arrows going out of an operation to re-emphasize the data mappings (e.g., multiple $\bullet$ problems coming out from Split points, which is approximated with three lines, and a single $\bullet$ paragraph out of Compose points).

On the right, the Step view (Figure 3B) allows users to explore each LLM step by interacting with inputs, outputs, and the underlying prompt structure. It is divided into an instruction block and several running blocks to handle parallel paths. Each of these parallel paths translates to a different LLM invocation; they share some common parts in their prompt strings, while having other parts being distinct from each other. We use the running blocks to hold the unique parts, and the instruction block to hold the shared sub-string is pre-pended to all running blocks, such that they are combined to form the full prompt. For example, Figure 3 $b_{2}$ is the final prompt for the step that generations suggestions for the problem “too much text.” It starts with the content from the instruction block ( $b_{1}$ ), and merges the text in the running block thereafter, ignoring the other parallel running blocks.

Every running block visually resembles a textarea with a number of editable text fields. It shows the prefix fields before colons (e.g., $\bullet$ Short suggestions for improvement, $c_{1}$ ) in the same color as the data layer rectangles, which helps users distinguish between data layers. It also includes text fields ( $b_{4}$ , $c_{2}$ ) for the model output for that step. The number of text fields (e.g., 1 vs. N) are consistent with the data types defined for the primitive operation for that step.

This view also handles the per-step execution. Users can click the small “run” button to execute each running block individually. Alternatively, users can use the Play button on the top to run all the parallel blocks at once and compare their results. To improve natural language interaction transparency (R.3), running a block also triggers a preview of the final prompt text ( $b_{2}$ ). The output is then parsed and added to the corresponding field ( $b_{4}$ , $c_{2}$ ) for users to further iterate on.

Interactions and controls. Notably, there are three levels of control available with this interface (R.2), from local customization of prompts to global modification of the LLM Chain structure, each with clear cues on its impact. First, users can customize the prompt for a particular step, e.g., by changing its task descriptions. Since the customization only applies to the current step, all other views remain unchanged. Second, users can customize the model output for that step by adding, deleting, or editing content (e.g., editing “read outlines” to emphasize main points in $b_{4}$ ), or rename data layers (e.g., rephrasing “Alex’s presentation problems” as “Criticisms of Alex” in $a_{1}$ ). These changes impact both the current step in focus as well as other steps involving the shared data layers (e.g., Compose Points takes in both the “problems” and the “suggestion” layer), and thus they can be changed either in the colored rectangles in the Chain view, or through text fields in the Step view. Finally, users can more aggressively modify the Chaining structure itself by adding, removing and rewiring operations or data layers in the Chain view through intuitive visual programming (R.3). The change would then cause the entire Chain to re-render, with the defaults (e.g., temperature, instructions) refreshed.

User Study

To understand how Chaining affects the user experience of accomplishing tasks with LLMs, we conducted a within-subject user study comparing Chaining with a state-of-the-art baseline interface, on two user tasks.

Underlying LLM. All of our experiments (including our baseline interface introduced below) and each step of the Chaining interface rely on exactly the same underlying LLM: LaMDA (Thoppilan et al., 2022)We used a non-dialog version of the model., a 137 billion parameter, general-purpose language model. This model is roughly equivalent to the GPT-3 model in terms of size and capability: it is trained with more than 1.5T words of text data, in an auto-regressive manner using a decoder-only Transformer structure which is useful for text generation. It has comparable performances with GPT-3 on a variety of tasks, and behaves similarly in its ability to follow prompts. Note that we only use this model to represent the recent class of LLMs; essentially, the chaining interface is model agnostic, and is compatible with any LLM that has in-context learning capability.

Systems. We compared Chaining with Sandbox, an interface that looks aesthetically similar to the Chaining interface, but without the Chaining functionality. We based the Sandbox interaction on GPT-3 playground,https://gpt3demo.com/apps/openai-gpt-3-playground the standard online interface for LLMs. It presents a single textbox with a run button, which allows the user to enter the text prompt, run the model on that prompt, and then view the model result in the same textbox, with the ability to edit that result and then continue to iterate. Like the Chaining interface, the Sandbox also allows users to adjust the temperature setting through a knob.

Tasks. We conducted the study using two tasks: peer review writing, and personalized flashcard creation, as they reflect different types of challenges (as explained below), and are both commonly used in user-centered task scenarios (Cai et al., 2014; Edge et al., 2011; Cambre et al., 2018). In the peer review writing task (“Review,” our walk-through scenario), the user is given a paragraph (the same as in Figure 1) outlining three different problems in an imaginary person’s presentation style, and their task is to write a friendly paragraph with 1-3 suggestions for each problem. In flashcard creation (“Flashcard”), participants were asked to create at least ten English-French sentence pairs they could use while traveling in Paris, and to make them as diverse as possible while being personalized to their own travel goals.

Though both tasks are possible when using an LLM without any LLM Chains, they present different types of challenges which could potentially be improved through Chaining. The Review task implicitly involves multi-step reasoning (Challenge C.1 in Section 3): to create a thorough and constructive review, one needs to identify each problem, provide suggestions per problem, and compose all the suggestions into one paragraph. The Flashcard task, on the other hand, exposes the challenge of having sufficient diversity in light of LLM exposure bias (C.2). In the Chaining condition, we built a default Chain for each task. The Chain for Review in Figure 1 reflects the three aforementioned steps (as explained before); the Chain for Flashcard (see Figure 4) sources additional content from the LLM like $\bullet$ types of interactions in a trip, which can help the user diversify the flashcards.

Study procedure. Before the study, participants completed a 30-minute tutorial that summarized the concept of LLMs and demonstrated how both Sandbox and Chaining work.We took inspiration from the OpenAI: https://beta.openai.com/docs/introduction/prompt-design-101; the task used for tutorial is in Appendix B.3. They were told upfront that both systems rely on the same underlying LLM. Then, in an hour-long study, participants performed a randomly selected task (Flashcard or Review), once with each interface (Sandbox and Chaining), whose orders were counterbalanced. We first briefed participants on the task, and then asked them to accomplish it with LLM’s help in each interface until they were satisfied with the final results, or until they reached 25 minutes. Since LLM Chains came with automatically generated prompts (by filling in the templates), we similarly offered several default prompts for Sandbox that we knew to work reasonably, so that both interfaces had a fair starting point for prompt engineering (detailed in Appendix B). We encouraged participants to think aloud and describe their actions as they completed the task.

In the Chaining condition, participants were asked to first stick to the default Chain so that we could make consistent observations across participants in terms of how they use Chains. In the process, they could modify any other aspect (e.g., the prompt, the intermediate model outputs, etc.) At the end, we gave participants the option to modify the default Chain, so that we could observe how they would expect the LLM to assist them beyond the default design. Finally, participants completed an exit survey and a semi-structured interview. They rated their experience using each interface along various dimensions. These dimensions were chosen to reflect the effectiveness of the human-AI collaboration (e.g., support for their thought process, quality of the final result), and core user-centered challenges in human-AI systems (Amershi et al., 2019; Buschek et al., 2021; Huang et al., 2020) (e.g., transparency, controllability, and sense of collaboration). They also verbally compared their impressions of the two interfaces, and envisioned possible use cases for them.

Collected data. We collected and analyzed three sets of data. First, to assess participants’ self-perceived experience, we used a standard seven-point Likert Scale (Likert, 1932) to collect all ratings from the exit survey, with one being “Strongly disagree” and seven being “Strongly agree” with the statement in question (e.g., for system Transparency: “The system is transparent about how it arrives at its final result”). Detailed survey questions are listed in Appendix B.1.

We also observed and recorded their entire task completion sessions, and later transcribed their comments and experience for qualitative analysis. Second, to quantify their interaction mechanisms and behaviors, we logged their interactions with the two interfaces. We were particularly interested in how participants reacted and iterated on model outputs, so we sorted their interactions with text fields by: (1) whether participants mainly relied on running the model again to get a different result (Consecutive run), or if they also edited the prompt in between (Edited); and (2) when they edited the prompt, how dependent it was on the existing model generation: whether they closely CURATED and refined the model outputs, loosely interacted around them by CREATING completely new content, or tried again by UNDOING the outputs. The detailed categorization criteria is in Appendix B.2. Third, to assess the task outcome, we logged the final reviews and flashcards participants created. Blinded to the condition, two non-participants performed anonymous, paired comparisons on results from each participant in Sandbox and Chaining, choosing the result that satisfied the task goals the best.

Participants. We recruited 20 participants using email lists that reach a wide range of practitioners (e.g., UX designers, linguists, data analysts) at a large software company. Eight participants were 26-35 years old, eight aged 36-45, two aged 46-55, one 56-65, and one 18–26. As there is an initial learning curve associated with LLM capability, we required that participants had at least seen an LLM example before. Among those we recruited, half of the participants had no prompting experience but had seen online demos powered by LLM models, whereas the other half had some basic experience using default text prompts. Further, as the goal of Chaining is to use LLMs to assist with human tasks, we sought to recruit potential users of ML/LLM who would benefit from interacting with the models, rather than ML model experts or creators. Thus, our participants included technically knowledgeable but non-ML software engineers, linguists, UX designers, and data analysts who worked in a wide range of domains (e.g., health, privacy, cloud storage, etc.). Each participant spent approximately 90 minutes total in our study, and received a $40 gift certificate for their time.

2. Quantitative Results: Increased Transparency & Control, and Higher-quality Task Outcome

All the participants were able to complete the tasks in both systems within the given time: they spent $12.4\pm 4.0$ minutes in Sandbox, and $14.6\pm 5.4$ in Chaining. Student’s t-test did not show any significant difference between their completion time ( $t=-.1.1,p=.278$ ). In analyzing subjective ratings from participants, the logged clickstreams, as well as the final generated results, we found:

First, Chaining led to improved user experience in human-AI interactions. We performed the non-parametric Wilcoxon signed-rank test to compare users’ nominal Likert Scale ratings and, as shown in Figure 5, participants felt that Chaining helped them think through the task better (Chaining $6.0\pm 1.4$ vs. Sandbox $3.6\pm 1.3$ , $z=0,p<.001$ ), and gave them more control ( $6.2\pm 0.9$ vs. $4.5\pm 1.3$ , $z=3.0,p<.001$ ). They also rated Chaining as being more collaborative ( $5.7\pm 1.3$ vs. $=4.6\pm 1.6$ , $z=25,p=.04$ ) and transparent ( $5.4\pm 1.3$ vs. $3.8\pm 1.8$ , $z=9.0,p=.002$ ).

Second, Chaining shifted the types of edits participants made while interacting with the LLM. In Chaining, participants were more likely to make manual interventions, whereas in Sandbox, they often re-ran the model (without changing the prompt) — akin to “rolling the dice again” in an attempt to get better output. As shown in Figure 6A, this tendency to perform consecutive runs without altering anything from the previous run occurred $51\%$ of the time on average in Sandbox and $36\%$ in Chaining. Student’s t-test shows the difference is significant: $t=3.5,p=.001$ .The clickstreams fall into the continuous range of 0%–100%, and follows a normal distribution according to a D’Agostino-Pearson Test (e.g., $p=0.58$ for the ratio of consecutive runs).

The manual edits made were also finer-grained in Chaining than in Sandbox (Figure 6B). In Sandbox, people largely focused on either completely UNDO output and rerunning the model (45% of the time on average), or manually CREATING their own content as input to the model (14%). They only CURATED or modified existing text 41% of the time. On the other hand, in Chaining people performed CURATION 77% of the time, only doing UNDO and CREATE 18% and 5% of the time, respectively. The shift to CURATION is significant, according to Student’s t-test ( $t=-6.75,p<.001$ ).

As a result, Chaining led to higher-quality generations that met the task goal. The two independent raters consistently preferred Chaining results $85\%$ and $80\%$ of the time, respectively. The results also matched participants’ own judgements in Figure 5 (see Match goal) — they preferred their own final results from Chaining ( $6.0\pm 0.9$ ) to the Sandbox results ( $5.0\pm 1.1$ , Wilcoxon signed-rank test, $z=11.0,p\textbf{=}.002$ ).

Aside from using Chaining, many participants were also able to iterate on and customize the underlying Chaining structure. While five of them preferred the default Chains provided and didn’t want to change them, the remaining 15 people were able to identify parts they found lacking and suggested at least one change. 11 of them successfully implemented and executed one of their own solutions.

3. Qualitative results: Chaining as Guardrails and Operation Manuals

Through analyses of the transcribed think-aloud comments and semi-structured interviews, we further unpack the reasons behind the quantitative differences. Since we asked participants to explain their Likert Scale ratings, their interview responses naturally map to dimensions in Figure 5 like transparency, collaboration, etc. One author further sorted their think-aloud comments into the categories. Three researchers then conducted thematic analysis, examining relationships between categories and iteratively converging on a set of higher-level themes. In general, Chaining helped support human-LLM interaction by serving as (1) a guardrail that helped users stay on track towards the task goal (Section 5.3.2 and 5.3.5); and (2) an “operation manual” that implicitly explained how to use LLMs for less obvious objectives, and that provided channels for users to intervene (Section 5.3.1, 5.3.3 and 5.3.4). In the following sections, we present key themes on how Chaining improved the human-AI experience, as well as some additional challenges brought on by Chaining.

In Sandbox, participants tended to use the LLM for a single purpose, under-utilizing the model’s full potential in supporting various kinds of tasks. Four out of ten people in the Flashcard task only used the model as a translator in Sandbox, even though they were provided with default prompts that demonstrated how to generate English sentences using the model. In the Review task, even though nearly everyone (nine out of ten) used a two-step process of generating suggestions prior to merging them into the full paragraph (see the two-step prompt template in Appendix B.5), three people only relied on the LLM to generate suggestions, and then manually merged them into the paragraph themselves, without LLM input.

There may be two reasons for these behaviors. First, Sandbox naturally affords single-operation interactions. Given this, it is not surprising that users would gravitate toward using the model only for a part of the task that seemed most likely to yield promising results given the status-quo applications of machine learning (e.g., translation), overlooking others that may seem less likely to succeed (e.g., merging text into a paragraph). Indeed, some participants were unaware of less obvious sub-tasks (P4: “this is just a simple translation task” in Flashcard). Second, the friction of juggling multiple sub-tasks in Sandbox deterred some users from doing so. Even participants who became aware of the Chaining structure (from getting the Chaining condition first in their study condition order) struggled to replicate it using a single prompt. For example, P2 attempted to tackle both sub-tasks (generating diverse English sentences, and translating to French) simultaneously with a single prompt instruction: “Given the previous English sentence, translate it to French. Generate further English sentences relevant to travel in Paris.” However, because the instruction was too nuanced for the model to follow, they eventually resorted to manually creating their own English sentences.

Ultimately, this inability to fully utilize the model led to lower quality final results in Sandbox. For example, the flashcards had less topical diversity (P4: “I had limited diversity myself”) because the Ideation step in Figure 4A was rarely ever leveraged. As a byproduct of the inadequate support, participants also found collaboration in Sandbox to be shallow (P5: “I’m doing all the specific work [creating English sentences] and it’s just doing its one thing [translation]”). In contrast, Chaining allowed users to leverage the model in multiple ways. Seven participants particularly liked that they could accomplish multiple goals through the Chain, i.e., acquiring model-powered diversity in the Ideation step, while maintaining translation correctness in the Rewriting step. This additional support may have contributed to participants shifting from creation (manually creating text from scratch) to curation (modifying model outputs) as shown in Quantitative Results (Figure 6B). Quoting P5, “I didn’t need to give it as much, but it was giving me a lot.”

LLMs’ diverse primitive operations and capabilities also led participants to consider other ways the model might be helpful. For example, when asked to modify the Chaining structure itself, P1 in Flashcard swapped the Ideation step (which generated $\bullet$ types of interactions) with a Generation step to produce $\bullet$ a journal of my one day trip, so the model could “think about what conversations can happen across my day trip” and provide “less generic context suggestions.” The operations became inspirational here. P12 and P20 in Review both added a Classification step to determine if the paragraph is in the right voice or if a suggestion is actionable, only once they realized the classification operation existed.

3.2. The ability to isolate interventions and save progress enhanced controllability of LLM

Because each step of a Chain involves a separate run of the model, Chaining allowed users to control certain aspects of each sub-task independent of others. Four Flashcard participants in Chaining noticed that the desired model randomness should vary per sub-task, and tuned the temperature settings accordingly: they increased the temperatures in Ideation steps to broaden the diversity and creativity of model responses (Figure 4A and B), and lowered it for Rewriting to increase the chances of getting correct model output (Figure 4C). However, none of them did so in the Sandbox condition (e.g., P5: “I realized my temperature was always high in sandbox. I should have had it low at translation, and high when I ask the model for English sentences.”) Many Review participants also liked iterating on each of the presentation problems individually (e.g., “To much text on slides” vs. “No clear structure”) without affecting the others.

This well-scoped impact of interventions may explain why participants felt more motivated and comfortable making manual edits in Chaining (Figure 6A). Nine people felt more compelled to enact controls on sub-tasks, knowing that they did not have to worry about unintended effects on other parts.

Four of them further noted that this clean separation would be tedious (if not impossible) in Sandbox, hence the differences in the perceived controllability in Figure 5. For example, P13 in Review attempted to replicate the exact same Chain in Sandbox. They manually divided the original paragraph into three problems, then asked the model for suggestions for each, and to compose the final paragraph. However, rather than storing suggestions externally and starting fresh for each problem, they simply stacked them together in a single prompt: “Original paragraph:…; Problem: too much text; Suggestions: 1)…; Problem: Split…” The resulting long and intertwined text became overwhelming: “I was very nervous to edit anything, because I didn’t know how that was going to impact the end task goals.”

Beyond staged interventions, staged outputs also provided participants with the opportunity to evaluate and improve individual components irrespective of previous failure (Nushi et al., 2017). Three participants praised the ability to “freeze” their preferred intermediate data points: “I reached some point of some progress in the middle of the Chain and if this works, then it’s fixed when I play with the next step. It doesn’t get lost — unlike the sandbox, where whenever I change something somewhere the result will be completely different” (P10). Their observations are also in line with the crash-and-rerun capability of crowdsourcing (Little et al., 2010), where local reruns are desirable without affecting previous stages.

3.3. Surfacing the Chaining structure increased transparency.

Chaining enriched system transparency, which helped participants better calibrate their expectations of the model. As each step of the Chain had a specific role (Ideation, Rewriting, etc.), they helped narrow the scope of the model’s intended functionality, making it easier for participants to understand what to expect from a model that might otherwise seem all-encompassing. Nine participants noted this benefit of calibrated expectations. For example, P6 commented that “Chaining helped you speak the language. It lift[ed] up the hood and showed you the steps and what’s happening at different phrases,” and P15 stated that “having default settings like your templates gave me an idea of how it works.” As elaborated in Section 5.3.2, having isolated steps, each with a reduced scope, also enabled users to better anticipate the potential impact of their inputs, further increasing system transparency.

More globally, Chaining enabled users to develop a more accurate mental model of the LLM’s capabilities, by allowing them to tinker with sub-components in a modular and comparative manner. Users could, for example, compare parallel paths to deduce how the model would respond to alternative inputs. In the Flashcard task, P8 noticed during the Ideation step that the model generated more useful English sentences when the $\bullet$ types of interactions was “accommodation,” compared to “topics related to public transportation.” This hinted at the model’s better performance when presented with a useful keyword. Modifying the order of LLM steps also enabled users to learn aspects of the model’s strengths and weaknesses. When customizing the Chaining structure, five participants tried adding another Rewriting step either after the final paragraph (at the end of the Chain), or on the individual presentation problems (early in the Chain). Though initially unaware that LLMs can suffer from exposure bias (see C.2), participants quickly discovered through this comparison that the model could more effectively modify sentences than paragraphs. This comparison was rare in Sandbox, as it was not obvious to participants that they could keep the LLM functionality but shorten the input.

3.4. Surfacing the Chaining structure increased debuggability.

The increased transparency in Chaining also gave users better debugging mechanisms. When the model output was inconsistent with user intent, participants were at a loss for what to try next in Sandbox. Because users could conceivably type and modify any natural language prompt in the text box, the scope for “debugging” was too expansive. P9 remarked that “too much freedom can be a curse,” while P7 felt like “sitting down in front of the controls of an airplane, all the knobs are there but I don’t know what to do with them.” Instead, Chaining exposed intermediate knobs that helped participants draw a more direct connection between observed model deficiencies, and possible remediation. P9 found it easier to debug by modifying the inputs and outputs for each step of the Chain, rather than merely re-running the model in Sandbox repeatedly, in the hopes of more promising model results (“I had to constantly delete and rerun things.”). This may explain why the frequency of UNDO actions was reduced in Chaining (Figure 6B).

Accordingly, three interesting debugging mechanisms emerged: First, the isolated steps in Chaining acted as AI “unit tests” that enabled users to pinpoint a seemingly global error to its local cause. For example, participants in Flashcard frequently removed topics irrelevant to traveling (e.g., education), so that sub-optimal solutions would not be fed into subsequent steps. Second, the ability to create parallel paths and alternate step orders (elaborated in Section 5.3.3) enabled comparative debugging. Revisiting the case mentioned above, observing a higher-quality path (e.g., using a simple keyword in the prompt like “accommodation”) helped participants infer how to improve prompts in other parts of the Chain (e.g., changing “topics related to public transportation” to “public transportation.”)

Finally, the ability to propagate a change throughout the entire Chain gave users immediate feedback on whether a fix was successful, thereby shortening feedback and iteration cycles. For example, P3 renamed $\bullet$ types of interactions with $\bullet$ places where conversation might occur, so as to “have flashcards grouped by happening at the airport, restaurant, while walking around streets.” They were impressed by the changes propagating to the final results: “you can just change a step without affecting other steps but then your final results are reshaped based on that. I didn’t think that was going to work that simply.” This combined ability to both isolate and propagate interventions was key to increasing AI debuggability.

3.5. Scoped objectives in sub-tasks served as guardrails against LLM-inspired tangents

One challenge that hindered participants’ performance on the tasks was LLMs’ randomness and creative surprises. The model would often produce outputs that were compelling in their own right, which in turn would derail people from the intended task. For example, P5 in Flashcard was intrigued by an LLM-generated English sentence, “That man is suspicious to me,” and started tricking the model into writing a story — “I want to know what happened to the suspicious man!” Five out of twenty people wandered from their task goal in Sandbox and began exploring tangents or attempting to “break” the model. They had to be reminded several times to get back on track. Participants later recalled their habit of drifting: “I tried a lot of cool things, but it’s not the task I want to complete” (P17).

Interestingly, we found Chaining acted as a safeguard against model-inspired tangents, not only because each step of the Chain defined a clear goal, but also because the interconnected data layers motivated participants to deliberately steer outputs of each step away from cascading errors (e.g., incorrect problem extraction in the first step of Figure 1 $b_{1}$ could lead to a poor final paragraph). In the Ideation steps, participants would even manually move model output around to make sure they fit the topic (P7: “this isn’t really about asking for directions, I should put it in accommodation.”) Ultimately, participants treated the entire task more carefully (see Figure 5, think through) — “if I was trying to do it with speed, I might find the sandbox easier; but if I want to do it with precision, I prefer the Chaining structure.” (P13).

3.6. Additional challenges

Chaining brought many benefits to human-AI collaboration, but it also presented several challenges. Nine participants noted that although they found the Chains to be transparent, rich, and educational, they were also more complex, with steeper learning curves. Moreover, while Chaining enabled participants to zoom into sub-tasks in modular ways, it also occasionally made the larger picture more difficult to recall: Four participants had questions about “how my particular change to this data entry will affect the final result” in Chaining (P2), and commented that the end-to-end aspect of Sandbox enabled them to see the direct effects of their actions. These challenges may have been a side-effect of participants using pre-defined Chains, which may not necessarily reflect their own intuition of how they would have decomposed the task (Xie et al., 2017; Carroll and Olson, 1988). Most people had a much more fluent experience with the Chains they modified — “I liked creating my framework.” (P13). Though beyond the scope of this paper, this raises the question of how to support users in not just using Chains, but also authoring their own Chains, to improve user agency and intuitiveness of Chaining (Wu et al., 2022).

Moreover, while Chaining provided better guardrails for staying on task, it may come at the expense of a decreased ability to explore freely; three participants mentioned they would prefer Sandbox for “trying out random things and see if the model can cope” (P3), and “I feel more at liberty to play with language outside the the Chain” (P6). They suggested they would prefer a combination of both systems: “when there’s more ambiguity I prefer the sandbox to explore first, but once I have a clear goal, I would use the Chaining to steer myself towards a fixed number of function blocks.” (P13)

Inspired by these concerns, we envision future research to focus on relaxing certain structural constraints and providing guidance on LLM Chain creation and refinement, which we detail later in Discussion (Section 7).

Case Studies

Beyond the user study tasks, LLM Chaining has the potential to enable a wide range of complex applications. We illustrate how Chaining could support more diverse applications through two case studies in the domains of software development and accessibility, using the same model in our user study.

In this case study on visualization code debugging, we uncover how intermediate data points in a Chain can become useful, especially when the end goal of the task is unclear. Unlike typical code syntax errors, when a visualization violates design constraints (Moritz et al., 2018), there are usually multiple valid solutions that cannot be objectively ranked. For example, the $\bullet$ original visualization (using VegaLite specifications (Satyanarayan et al., 2016)) in Figure 7 has a single violation, i.e., circle size is continuous and thus should not be used to represent the discrete (nominal) field “Origin.” However, there may be multiple ways to resolve the issue (Chen et al., 2021), such as using color instead of size ( $d_{1}$ ), removing size information altogether ( $d_{2}$ ), or changing the data encoded to a continuous “Acceleration” field ( $d_{3}$ ). Thus, LLMs should reason about the violated constraints for users to adjust the fixes. However, in a single run of an LLM, this reasoning can be challenging, as LLMs have trouble parsing visualization specs in JSON formats (see LLM Challenge C.3 in Section 3.1).

We thus created a Chain (see Figure 7) that (A) rewrites the JSON format in natural language, (B) classifies and validates the descriptions, and (C) rewrites the spec. To explore how the Chain performs in practice, we took examples from VizLinter (Chen et al., 2021), used five pairs of erroneous and fixed specs as few-shot prompt examples, and tested the Chain on another five cases. One author with sufficient visualization knowledge determined that the Chain correctly revealed the violated constraints for all the test cases, and provided useful fixes for two of them. We also tried running a single pass of the LLM for comparison on the same examples, using multiple prompt designs. We observed that output from the single-passes tended to be consistently worse, with at most one correct reasoning. This is possibly due to parsing difficulty (see LLM Challenge C.3), as well as the inability to disentangle the sub-tasks of validation and rewriting (C.1). In contrast, each Chain step was highly scoped, increasing the chance that the intermediate data would be correct.

2. Case 2: Assisted Text Entry

We further demonstrate how Chaining could enable the branching logic in assisted text entry. This is based on a real industry use case that aims to speed up gaze input by requiring fewer character inputs (Majaranta and Räihä, 2007; Rough et al., 2014; Adhikary et al., 2021). Ideally, a user (e.g., person using Alternative and Augmentative Communication technology) would express a full sentence through short abbreviations that an LLM would automatically expand. However, there are too many possible expansions to disambiguate, e.g., “LTSGCHKITOT” could mean “Let’s go check it out ,” “Let’s get coffee and have a chat ,” “Let’s get some chicken in the old town,” etc. Thus, the end user often needs to resolve the ambiguity or adjust the input.

With Chaining, we enable interactive disambiguation through gradual expansion and if-else logic. As shown in Figure 8, if the user input is a shorthand (e.g., “LTSG”), the LLM should expand it to possible matching phrases (“Let’s go”, “Let’s get”), which the user can select from. However, if the input is already a phrase, the LLM should instead auto-complete it (“Let’s go” may trigger “check it out.”) If the desired option does not appear, the user can also insert additional short-hands for the model to expand again, e.g., “Let’s go CHKITOT”, which would exclude expansions starting with “Let’s get.” The switch between shorthand expansion and auto-completion enables better prediction on the full text, which would be nontrivial for a single prompt, given the different natures of the two branches. This case also provides a glimpse into how LLM Chains can help prototype applications with complex logic but simple interactions (elaborated in the next section).

Discussion & Future Directions

Our work is a first step towards improving human-LLM interaction through Chaining. We found that it not only raises the ceiling of what LLMs can meaningfully support, but also boosts transparency, controllability and debuggability — key concerns when interacting with generative AI (Amershi et al., 2019; Bommasani et al., 2021). Interestingly, we achieved this purely by reshaping the interaction mechanism, without any need to retrain the model. This suggests that LLMs to date may already have the potential to support human-AI collaborations on many complex tasks, if their latent potential can be better realized through thoughtful interaction design. Below, we discuss the implications of our studies, as well as future research directions.

Contrary to recent work in human-AI interaction, which primarily examined how to increase AI controllability through exposing knobs within a model (Louie et al., 2020; Narayan et al., 2019), our work opens up the possibility of steering AI using the model itself as units to control. In other words, beyond controlling properties within a single model unit, users may be able to achieve new kinds of control through manipulating how multiple model runs interact with one another, including: how modifications to upstream model units cascade, how to isolate changes between model units, and how to improve user inputs by comparing the effectiveness of parallel model runs. As language models grow in size and capability, they may ironically allow users to treat them as smaller entities of abstraction — serving as building blocks towards larger human goals.

We envision the HCI community innovating more types of building blocks that a model can provide, as well as the ways they can be combined. In particular, model units could be used not only to accomplish sub-tasks, but also to more thoroughly aid in the task decomposition design and debugging process. To overcome users’ own systematic omissions (Wu et al., 2021), an upstream unit could be designed to help users create sub-tasks to begin with, similar to metaprompting (Reynolds and McDonell, 2021). Or, model units could serve as checkpoints along the Chain to ensure data correctness (similar to assertions in code). Moreover, while the Chains in this paper consisted of only LLM steps, alternative designs may also interleave LLM steps with human-computation steps, depending on which roles each collaborator could best fill.

Chaining also opens up new possibilities for designing AI-infused applications. With LLMs’ easy adaptation to natural language prompts, users could conceivably already prototype custom ML functionality with lower effort, as they bypass the otherwise necessary but expensive process of collecting data and designing models upfront (Bommasani et al., 2021). Chaining further accelerates this design process. Taking advantage of interactions between multiple LLM steps, developers could build multiple Chains to envision possible flows of how an application may be used, and then perform A/B testing on those Chains. For example, in the case of assisted text entry (Section 6.2), developers could quickly prototype what might happen if end users were allowed to provide more context: e.g., if the user is “having a meeting in 5 minutes,” then “Let’s go” is more likely than “Let’s get” for the abbreviation “LTSG.” They could test this interaction by adding an additional layer of input to the shorthand expansion step.

One might argue that, because each run of an LLM involves some computational overhead, chaining may introduce additional costs that need to be weighed against their benefits. However, as indicated above, a key benefit of chaining is that it could flexibly power a wide range of prototypes and applications, without the need to train or build bespoke, single-purpose AIs. Thus, we believe the saved efforts outweigh the cost.

While Chaining provided guardrails and scaffolding for helping users accomplish the task at hand, it also limited their ability to explore freely. Yet, experimenting, tinkering, and interacting are key to users forming mental models for AI (Narayan et al., 2019). One way to balance between structure and exploration is to loosen structural constraints within steps. For example, it may be useful to permit users to customize prompts within each step in a Sandbox-like environment, and to define their own input and output parsers. In other words, rather than providing a full implementation of steps, a Chain could define the API with input-output types, and ask users to fill in the implementations for each step. Or, a small Sandbox could be provided along-side the Chaining interface, for users to occasionally use when they need to experiment with a new approach.

Meanwhile, though our studies mostly explored how humans use pre-defined LLM Chains, a natural follow-up question becomes whether end users can effectively author their own LLM Chains. Indeed, one potential downside of Chaining is that it may decrease transparency if the pre-built Chain does not match the way a user would naturally break down the task (mentioned in Section 5.3.6). We believe our operations can serve as a starting point for future work on authoring. With the templates, users could instantiate an LLM step by defining the data layers and selecting the operations. In our study, most participants were able to spot deficiencies and refine the default Chains accordingly. Thus, we envision that a set of generic default Chains could help onboard end users to the idea of LLM Chaining, and inspire them to author more tailored Chains. We leave end user authoring of Chains to future work.

Our work centered mostly on moderately complex tasks that can be naturally broken down. However, decomposition might be less straightforward in some cases (Kim et al., 2017). Tasks with more complex interdependence may lose coherence and quality if they are split into independent sub-parts. For example, in the Review task (Figure 1), we treated the different problems independently. However, if the problems are interrelated, keeping them together would promote more effective suggestions (e.g., not engaging and speaks too quietly). Moreover, while users had the option of excluding specific data layers along the way (e.g., the original review in Figure 1 is not fed into the final step), the information loss may also lead to task distortion or compression (Reynolds and McDonell, 2021). In light of these issues, future work could investigate how to assist users in crafting the steps of a Chain to maximize its utility (Kittur et al., 2011). For example, users could be provided strategic guidance on iterative Chain improvements, such as using paired comparisons and version control of Chain edits to help users decide whether to keep or further decompose an existing step.

Conclusion

In this work, we introduce the notion of “Chaining” multiple LLM steps together, such that the output of one step is the input to the next. We present an interactive system where users can modify these Chains, along with their intermediate results, in a modular way. We find that Chaining not only enhanced the quality of the task outcome, but also improved user satisfaction, with an increased sense of control and collaboration, a greater perception of transparency of the LLM system, and more support of the user’s thought processes. Furthermore, we envision with case studies that LLM Chaining may be advantageous for complex AI-infusion applications and in cases where intermediate reasoning is more important than the final output. We encourage future work to explore how LLMs can serve other kinds of building blocks, how Chains can be used in rapid prototyping, and strategies that can help users build and iterate on Chains.

References

Appendix A Identifying LLM Primitive Operations

We reviewed 73 existing demos to identify promising LLM capabilities that may help overcome the challenges above by scoping the inputs/outputs to be more amenable to what an LLM can handle.

First, we collected demos from LLM official websites (e.g., GPT-3 and Jurassic), social media, and published case studies by searching for keywords including “GPT-3,” “language model,” “prompt,” etc. After removing some demos that were highly open-ended rather than targeted (e.g., generic chatbots), we iteratively sorted the demos into eight LLM primitive operations, as shown in Table 1. For example, we distinguished between operations that had different expected data mappings (one-to-many v.s. many-to-one), and different application types (deterministic v.s. creative). We then grouped the primitives into three high level groups based on which LLM challenge they may help address. The groups also appear to be consistent with categories presented on the GPT-3 tutorial page,https://beta.openai.com/docs/guides/completion/introduction which highlighted typical NLP tasks like Classification, Generation (i.e., gather additional information in Table 1b), Transformation (i.e., re-organization). Finally, we further refined the primitive categories and names based on feedback from three pilot users (one LLM expert and two UX engineers with basic knowledge of LLM prompting).

Appendix B Additional Details for User Study

After completing the given task in both conditions, participants self-rated their experience on the following dimensions, in the form of seven-point Likert scale (Likert, 1932). Each question was asked twice, once on Sandbox and once on Chaining. They described their reasoning along with the ratings.

Match goal: I’m satisfied with my final results from [ Sandbox/Chaining ]; they met the task goal.

Think through: The [ Sandbox/Chaining ] system helped me think through what kinds of outputs I would want to complete the task goal, and how to complete the task.

Transparent: The [ Sandbox/Chaining ] system is transparent about how it arrives at its final result; I could roughly track its progress.

Controllable: I felt I had control creating with the [ Sandbox/Chaining ] system. I can steer the system towards the task goal.

Collaborative: In [ Sandbox/Chaining ], I felt I was collaborating with the system to come up with the outputs.

Additionally, participants also answered the following two free form questions:

Difference: What were the differences, if any, between the experience of completing the task using Sandbox and Chaining?

Vision: If you were using language models in your work, in what situations would you prefer to use Sandbox? Chaining? Can you think of 1-3 concrete examples?

B.2. Clickstream Categorization

we log the text status before and after each round of model run. Through sequence match, we recover what’s generated by the model after each run, and how the participants edit the text in between of two runs. We split the logs into: (1) RUN the model, (2) UNDO the model, where people removed the generations from the previous run, making the resulting text more similar to prior to the previous run, (3) FORMAT, where people only add or remove line split or formatting-related stopwords, (4) CREATE-CONTENT, where people only insert meaningful spans to the text, (5) CURATE-CONTENT, where people make all the other kinds of refinements on the existing text — in Chaining, this is a merge of changing the instruction, prefix, and the data entries. We also logged (6) CHANGE-TEMPERATURE to denote when people make non-text based change on the model input, i.e., temperature.

On top of the logs, we define consecutive runs (in Figure 6A) as those in which users did not change anything after the previous run (or only add formatting through line changes or adding stopwords, i.e., RUN+FORMAT). Otherwise, the logs are counted as humans making edits.

B.3. Case 0: Metaphor Creation (Used in tutorial)

Create metaphors for the concept of crowdsourcing, so that we can explain the different aspects of crowdsourcing in a poetic way. The pipeline is as in Figure 9.

In crowdsourcing, people are like bees; they work together to make honey.

With the concept being “crowdsourcing”, the simile being “bees”, and the similar aspect being “work together.”

In the form of question answering, Question: What is a good metaphor for crowdsourcing? Answer: a swarm of bees.

In the form of command instruction, Write a metaphor for the concept of crowdsourcing. Concept: crowdsourcing Metaphor: Crowdsourcing is like a game of chess. A crowdsourcer’s skills, as in a chess player’s skills, are combined with another person’s skills to make something new.

List enumeration The following is a list of metaphors on crowdsourcing. 1. Crowdsourcing is like a beehive -- Many people (bees) contribute to a larger cause.

Few-shot example, Concept: gratitude Metaphor: gratitude is like a stream in that it’s a force that can carry you along. ### Concept: loss Metaphor: loss is like a wing in that it’s something you never wanted to lose, and it can take you away. ### Concept: crowdsourcing Metaphor: crowdsourcing is like a team sport in that it brings people to achieve one goal.

B.4. Case 1: Flashcard Creation

Suppose you will be traveling to Paris next week, and you would like to create flashcards to learn about some basic French so you can have basic conversations with local people whenever you are in a non-English speaking region. Your goal is to create flashcards that are both diverse and personalized to your travel desires.

English: Where is a good restaurant?; French: Où est un bon restaurant?

In the form of question answering, Question: What are some English and French sentence pairs useful for traveling to Paris? Answers: English: Where is a good restaurant? French: Où est un bon restaurant?

In the form of list enumeration, The following are English and French sentence pairs useful for traveling to Paris. 1. Que fais-tu ici? / What are you doing here?

Few-shot example, English: I do not speak French. French: Je ne parle pas français. ### English: Where is a good restaurant? French: Où est un bon restaurant? ### English: Where’s the bus station? French: Où est la gare routière?

B.5. Case 2: Peer Review Paragraph Rewriting

The following is a peer review paragraph on Alex’s presentation. Please rewrite it to be more friendly and constructive. The paragraph should include 1-3 concrete and actionable suggestions for each of Alex’s presentation problems.Original feedback:

Alex could improve his presentation skills. He has too much text on his slides. His presentation meanders from topic to topic without a clear structure. He also does not engage with his audience when he presents.

An example output from the language model:

Alex, you have a lot of good ideas in your presentations. However, your slides are too cluttered with text. I would recommend that you use more graphics, pictures, and white space. Your slides should be simple enough that you could read your presentation off of the slides. This will allow you to engage the audience and not have to read from your notes. Also, you need to have a clear structure to your presentation. Lastly, you need to practice speaking. This will familiarize you with the material and reduce the likelihood of you meandering.

End-to-end version, Given the feedback, rewrite it into a friendly paragraph with concrete suggestions on each of Alex’s presentation problems. Feedback: Alex could improve his presentation skills. He has too much text on his slides. His presentation meanders from topic to topic without a clear structure. He also does not engage with his audience when he presents. Friendly paragraph: [LLM generation]

Two-step version, where we query LLM for improvement suggestions first, and then ask it to integrate the problem and the suggestion. Alex could improve his presentation skills. He has too much text on his slides. His presentation meanders from topic to topic without a clear structure. He also does not engage with his audience when he presents. Give Alex some suggestions on his presentation: 1. [LLM generation] Write one friendly paragraph that covers all the presentation problems and suggestions: [LLM generation]