Language Models can Exploit Cross-Task In-context Learning for Data-Scarce Novel Tasks

Anwoy Chatterjee, Eshaan Tanwar, Subhabrata Dutta, Tanmoy Chakraborty

Introduction

Large Language Models (LLMs) have revolutionized the state of Natural Language Processing for the past few years. With the ability of In-Context Learning (ICL), one can adopt an LLM to almost any task without costly gradient updates (Brown et al., 2020). At the same time, automated assistants, built on top of foundational LLMs, are being popularized (Pahune and Chandrasekharan, 2023). A crucial challenge with this escalating usage popularity is handling novel tasks. Humongous models like GPT-4 are able to deliver up-to-par performance even in a zero-shot regime (OpenAI, 2023). However, the computational requirements of deploying such large-scale models counteract the practicality of their usage en masse. Relatively smaller LMs, on the other hand, suffer drastically in the absence of in-context examples. The availability of labeled examples usually varies across the use cases of the language model. For example, in an NLP research setup, expert users can quickly come up with a few handwritten examples. However, when we consider mass-scale usage of average users who are not experienced prompt engineers or need quick answers, the zero-shot performance of a model becomes extremely crucial. For example, assuming the popular usage of ChatGPT, very few non-expert users would opt to write down examples while asking ChatGPT to perform some tasks.

This naturally raises the question of whether one can make an LLM generalize from labeled examples of a predefined set of tasks to an input defining a novel task. In the world of biological neurons, such abilities are commonplace: inculcating specific limb usage into an untrained limb while training the opposite limb Ruddy and Carson (2013), or relatively easier adoption of newer skills from the culminated experience of older skills Qin et al. (2013); Márton et al. (2021). Drawing a blunt parallel between biological neurons and LLMs would be naive. However, one can find a supporting intuition in the mechanistic interpretation of the Transformer architecture Elhage et al. (2021); Conmy et al. (2023); Wang et al. (2022a). One can argue that if information pathways necessary to solve a novel task are similar to those corresponding to some different task from a task library, an LLM may gather useful information across tasks. Earlier evidence found by Tanwar et al. (2023) also elicits intuitive motivation as they showed that LLMs can learn to infer from cross-lingual examples if proper alignment is provided.

In this work, we design a cross-task prompting setup (Section 2) using three different LLMs: LLaMA-2 7B and 13B, and GPT 3.5; we select $50$ different pairs of tasks where one serves as a source (i.e., context example task) and the other as target. Despite no examples from the target task presented in the context, LLMs can produce a staggering improvement over the zero-shot regime; on average, cross-task prompting improves performance by $107\%$ for LLaMA-2 7B, $18.6\%$ for LLaMA-2 13B and $3.2\%$ for GPT 3.5 (Section 1). With multiple source tasks, cross-task performance is even better than, if not comparable, usual in-task prompting (Contribution #1). However, learning from examples of different tasks is heavily sensitive to the choice of source task for a given target and the LLM is prone to copy the label space of the source task into the target. To circumvent this, we propose a pseudo-labeling based approach: in a data-scarce setup, cross-task prompting with majority voting is first employed to generate noisy, in-task examples; these are subsequently used for standard few-shot prompting (Contribution #2). Finally, we provide introductory analysis towards interpreting cross-task signal transfer by dissecting the model activations. We find that the cross-task signal transfer is abrupt and happens at later layers, with the effective layers widely varying for different target tasks (Contribution #3). In a nutshell, this is the first exploration of LLMs’ ability to learn to solve novel tasks based on the contextual signals of different task examples.

Prompting Techniques

Task definition for a given task is a natural language instruction for the LM describing what is asked of it (see Figure 1). Since a cross-task setup would result in in-context examples from different tasks (with different label spaces), such a definition is necessary to discriminate.

Next, given two datasets $D_{s}$ and $D_{t}$ corresponding to two different tasks with task definitions $d_{s}$ and $d_{t}$ , respectively, we formalize the cross-task prompting as inferring the output $\hat{y}_{t}$ for an input $x_{t}\in D_{t}$ conditioned upon a demonstration from $D_{s}$ :

In this setup, we denote $D_{s}$ and $D_{t}$ as source and target task datasets, respectively. Note that the formalization is equivalent to 1-shot prompting where the only provided example input-output pair comes from a different dataset. Table 13 of Appendix E presents illustrative examples of prompts in this setup. Our experiments suggest that $>$ 1-shot setup in cross-task prompting does not improve (and often deteriorates) performance.

Liu et al. (2022) showed that for a target input $x_{t}$ , generating context $C$ from semantically similar examples leads to not only better results but also a more robust method of prompting. Similarly, Tanwar et al. (2023) showed that ICL could be done in a cross-lingual setup by aligning the semantic and task signals of the source and target languages.

Drawing inspiration from these works, we set up the cross-task prompting regime with sampling examples $x_{s}$ from the source task dataset $D_{s}$ that are semantically similar to the target input $x_{t}$ . To extract semantically similar examples, we first utilize Sentence-BERT Reimers and Gurevych (2019) to extract the sentence embedding of target and source inputs We also experimented with E5 and LLaMA-2 7B last layer outputs, see Appendix B.. Following this, based on the cosine similarity between the embeddings, we select top source examples.

In-task examples combined with cross-task prompting. So far, we have assumed the unavailability of any labeled target dataset. But, if we do have a labeled target example, could prepending a source task boost its prompting performance?

To emulate such a scenario, a labeled example from the target dataset $T_{t}$ is sampled. This labeled example $(x_{lt},y_{lt}),\ \text{where}\ x_{lt}\in T_{t}$ , is then used to construct prompt. This mixed setup can be formalized as,

Examples of prompts are provided in Table 14 of Appendix E.

Results and Analysis

Datasets and experimental setup. Our corpus of tasks consists of ten source and five target tasks. We consider ARC-Easy Clark et al. (2018), AG-news Zhang et al. (2015), BoolQ Clark et al. (2019), Commonsense-QA Talmor et al. (2019), Conll2003-POS Tjong Kim Sang and De Meulder (2003), Conll2003-NER Tjong Kim Sang and De Meulder (2003), MNLI Williams et al. (2018), QQP Sharma et al. (2019), RACE Lai et al. (2017) and SST2 Socher et al. (2013) to be our source tasks. Our motivation remains to incorporate domain diversity and difficulty of the problems while choosing target tasks to emulate the “novel task” phenomenon as closely as possible since learning from cross-task examples makes sense only when the target task is truly data-scarce. Our target tasks are Social-i-QA Sap et al. (2019), SciQ Auer et al. (2023), MedMCQA Pal et al. (2022), Financial-Phrasebank Malo et al. (2014), and ARC-Challenge Clark et al. (2018); first four of these require domain expertise, while the last one is a more challenging version of the one present in the source tasks. In total, this gives us $50$ unique cross-task setups (see Appendix A for dataset details. Table 15 and Table 16 of Appendix E show task definitions corresponding to source and target tasks, respectively). The dataset size is standardised by sampling $10,000$ and $500$ examples for each source and target task, i.e., $|D_{s}|=10,000$ and $|D_{t}|=500$ . For our experiments, we used the 7-billion and 13-billion variants of LLaMA-2 Touvron et al. (2023), and text-davinci-003 Brown et al. (2020) referred to as GPT3.5. We experiment with greedy and force decoding setup and selected greedy decoding as our standard for all experiments (see Appendix C for more details). We set the number of examples used to create cross-task context as one for all our experiments, unless mentioned explicitly.

Does cross-task prompting work? As evident from Table 1, cross-task prompting significantly improves performance compared to zero-shot prompting ( see Table 19 in Appendix for results of significance testing). The best overall source-target pair improves performance by 162% for LLaMA-2 7B, 30% for LLaMA-2 13B and 7% for GPT3.5. We note that different models give the best performance in different source-target pairs and not all coupling of source-target tasks seem to work; e.g., Commonsense-QA as a source task decreases performance on Financial-Phrasebank, but is the best source task for ARC-Challenge and MedMCQA (for LLaMA-2 7B). Token classification tasks (POS and NER) seem to depreciate performance for LLaMA-2 13B and GPT3.5; their performance for LLaMA-2 7B is also sub-par compared to other source-target pairings. RACE and ARC-Easy are robust source tasks that improve performance for all target tasks in all models. ARC-Easy, MNLI and BoolQ can also be considered to have this robustness to some degree, as they only hurt the performance in one or two cases.

On average, cross-task prompting improves performance by 107% for LLaMA-2 7B, 18.6% for LLaMA-2 13B and 3.2% for GPT3.5. This is a strong argument for prompting in a cross-task manner when we lack labeled target data, especially using small models, which have poor zero-shot abilities (Wei et al., 2022).

Importance of source task definitions. To further investigate the role of task definitions in cross-task prompting, we check the effect of removing source task definitions on performance. Table 12 reports the average accuracy for all possible source-target pairs corresponding to a task; We note an average drop of 11% for LLaMA-2 7B and 8% for LLaMA-2 13B when we prompt without using source task definitions (see Table 12 in Appendix). Hence, definitions play a crucial role in cross-task prompting.

Increasing number of examples for cross-task prompting. Unlike in-task prompting Brown et al. (2020), where increasing the number of examples increases prompting performance, in cross-task prompting, performance does not improve with an increase in source examples (c.f. Fig 3). For most target tasks, increasing source examples does not affect performance, while for some, the performance decreases.

Semantically similar vs random example selection. As shown in Table 2, in a cross-task setup, choosing the examples randomly leads to substantially poorer performance than when we generate the context $C$ with semantically similar examples; concerning is the fact that, in many cases, it causes 0% accuracy. This may be caused by the model getting confused without any semantic alignment in the prompt. Furthermore, random labeling of in-context examples (Wei et al., 2023) result in near-random performance (see Table 11 in Appendix D).

Mixed cross-task prompting. So far, we have seen that cross-task prompting works for single source-target task pairs. Next, we experiment on using multiple source tasks to construct the prompt context. To explore such a setup, we prompt LLaMA models using three methods:

Best source cross-task: We select the best source task for every target task using Table 1 and sample four semantically similar examples from that source task.

Random mixed cross-task: To see if a diverse set of tasks is beneficial, we randomly sample four source tasks and construct the prompt using most semantically similar examples.

Best mixed cross-task: This method is a combination of the first two; we use the top four best source tasks from Table 1 and sample a semantically similar source example from each task.

Table 3 shows that a mixed prompting mechanism does not perform better than the “best source cross-task” prompting method. On the contrary, it seems to hurt the performance of the model; in fact, single-source task prompting with only one example (Table 1) seems to do better than mixed prompting. Diversity seems to lead the model to get more confused, thus hurting its performance.

Combining in-task with cross-task prompting. Combining cross-task prompting with labeled target examples improves performance, as seen in Table 4; apart from Financial-Phrasebank in LLaMA-2 13B and MedMCQA in GPT3.5, for all other instances there exist a source task that improves the performance when coupled with the in-task example. However, we see that this improvement is immensely dependent on the source-target task pair chosen and unlike Table 1, we are unable to find robust source datasets that improve the performance throughout the setup. Nevertheless, in multiple source-target instances, there is a noteworthy improvement.

To study the interaction between heterogeneous tasks in the context, we experiment by varying the number of source tasks and target task demonstrations in the context. Figure 2 shows the variation of accuracy for 8-shots, as we gradually move from an entirely in-task context to a complete cross-task context. We observe that for all target tasks, apart from Financial-Phrasebank, the accuracy with an 8-shot in-task prompt and an 8-shot cross-task prompt (with best-mixed cross-task strategy) is almost the same in all three LLMs.

Pseudo-label generation using cross-task prompting

Thus far we observe that cross-task prompting, though sometimes capable of even outperforming standard in-task prompting, is particularly sensitive to the choice of source task. Furthermore, the performance does not scale with the number of source task examples provided. Drawing inspiration from earlier works on pseudo-labeled examples to construct prompts (He et al., 2022; Vu et al., 2021), we propose a more practical method for potential usage of cross-task prompting.

Given a small unlabeled dataset $D_{pl}\subset D_{t}$ , we assign a pseudo-label to the example using cross-task prompting. This is done using all source tasks available to us, and then a final $y_{pl}$ is assigned to $x_{pl}\in D_{pl}$ based on a majority vote from all the generated answers. Finally, this pseudo-labeled dataset $D_{pl}$ is used to construct the prompt context in an in-context prompting setup.

Cross task generated pseudo-examples vs gold label. To see the efficacy of our proposed method, we utilise a $D_{pl}$ of size 8 and have three setups to create examples for context: 1. Gold-label: We use the annotated version of $D_{pl}$ in the context; 2. Pseudo-label (ZS): We use zero-shot prompting to label $D_{pl}$ for the context examples; 3. Pseudo-label (CT): We use the cross-task method as proposed in Section 4.

The results of this experiment are shown in Table 5. For LLaMA models, as evident, pseudo labels generated via cross-task prompting are substantially better than zero-shot pseudo labels; they are of higher quality and lead to comparable performance as Gold-label.

In the case of GPT 3.5, the scale of improvement is much smaller though. Interestingly, with only two datasets, namely, SciQ and Social-i-QA, we observe the pseudo examples from cross-task prompting to perform worse than gold-label examples, in all three models. Given the comparable performance of cross-task prompt-generated pseudo examples, we expect this to become a viable alternative to traditional ICL in data-scarce scenarios.

Increasing number of pseudo-examples for in-task prompting. Figure 3 shows the relation between the number of pseudo-demonstrations and accuracy. The general trend shows a rise in performance when more demonstrations are used in creating context $C$ , except for Financial-Phrasebank, whose performance decreases with an increase in demonstrations.

Interpretability analysis

In this section, we focus on the internal workings of the LLMs using the hidden states of the model and build an understanding of why and how cross-task prompting works.

What is an ideal source task? Intuitively thinking of a source task that is similar to the target will result in better transfer of generalised signal from context to target inputs. To test this, we first compare the final layer outputs of the model corresponding to the source and target task definitions, $d_{s}$ and $d_{t}$ , respectively. Figure 4 shows the cosine similarity between the final layer hidden states corresponding to $d_{s}$ and $d_{t}$ . We note that for 80% of the time, the source definition that is the most semantically similar to the target definition also serves as its best suitor, leading to the best performance in Table 1.

What does internal activation indicate? To get an even more in-depth picture, we analyse the layerwise activations for the LLaMA-2 7B model. For each layer, we compute the cosine similarities between the mean activations corresponding to the source and target task definitions. For a given target task, we then compute the rank correlation (in terms of Spearman correlation coefficient, Pearson coefficient, and, Kendall’s tau) between the cosine similarities and the absolute point performance change from zero-shot prediction for each source task (see Table 18 in Appendix). Precisely, this gives us an approximate idea about the underlying mechanism of cross-task signal transfer. For all the tasks, we can see a U-shaped pattern in the correlation values (c.f. Fig 5) – at the starting layers, there is a high correlation between the cross-task activation similarity and the cross-task improvement (likely due to semantically similar example selection), that quickly drops in the middle layers, and increases again in the later layers (only exception being the MedMCQA target task). One can intuitively claim that in the initial layers of the model (Layers 2 to 5), there is more task-specific computation going on where the cross-task transfer of information is the least. For Financial-Phrasebank, these task-specific layers cover more than 80% of the layers, with a gradual increase in correlation observable only after Layer 28. Hendel et al. (2023) provided a similar finding that the influence of the context kicks in only after a certain number of layers via task vectors. We see that the exact layer after which cross-task demonstrations will start signal to transfer is much more dependent on the target task and can vary widely.

Error Analysis

We observe four types of error occurring in cross-task prompting (c.f. Table 6), as follows:

Label space replication. In example #1, the source example is from the Conll2003-POS dataset, a POS tagging task. In contrast, the target task is to predict the financial sentiment analysis. Therefore, the generation should be either negative, positive or neutral; however, we observe that the output is a sequence of POS tags instead. Here, the LLM is replicating the label space of the source task and the definition of the target task is not able to guide it to the correct target label space.

Junk prediction. In some cases, we observe that the output is neither from the label space of the source task nor from that of the target task. As in example #2, the source task is to classify each token into one named entity category (NER), and the target task is to predict the financial sentiment. The output in this case is found to be junk — it is a sequence of N’s but N is not a pre-defined named entity category, neither is it a label for the target task. Further, we observe that the correct label in this example is ‘neutral’ – the LLM could not follow the target task definition provided and gets confused between the correct prediction label of the target task and the token classification task where the output is expected to be a sequence of labels, and instead outputs a sequence with the initial letter of the correct label.

Copying effect: We also notice the copying effect, which has been observed by Lyu et al. (2023), where the predicted label is the same as the label of the context example which is semantically very similar to the target input. In example #3, the target input is semantically very similar to the context example from the source task provided in the prompt, and consequently, the LLM incorrectly outputs the same label as that of the context example.

Definition not followed: We observe that for some instances the LLM does not adhere to the definition given in the form of task definition for the target task. In example #4, the LLM is supposed to output one among the four options A, B, C, D, instead it outputs the text corresponding to the option D – though D is the correct answer in this case, the LLM is not able to follow the definition properly.

Related work

In-context learning without gradient updates to the model was introduced by Brown et al. (2020) using the GPT-3 model. Multiple recent works sought robust ICL setup via different techniques: selecting the examples that are semantically most similar to the input (Liu et al., 2022), choosing low perplexity examples (Gonen et al., 2022), training a two-stage retrieval system to select the best examples (Rubin et al., 2022), etc. However, these works primarily aim to construct better in-task prompts where the examples and the input come from the same task. Tanwar et al. (2023) showed that cross-lingual ICL can be elicited with proper alignment between the source and target language examples. Raffel et al. (2020) introduced the T5 transformer model which has been trained to perform different text-processing tasks. Zhang et al. (2022) proposed a task prefix guided multi-task pre-training framework to solve this problem. More recently, a new prompting method called Meta-CoT has been proposed by Zou et al. (2023) which generalizes chain-of-thought prompting to include multi-task examples in the prompt and it has shown improvement in a number of tasks.

Conclusion

In this paper, we addressed LLMs’ adaptability to novel tasks. Exploiting the inherent ICL capabilities of these models, we established that LLMs can substantially enhance their performance in novel task scenarios, even when direct examples are lacking. This encouraging outcome unveils fresh possibilities for the practical integration of LLMs across a broader spectrum of applications. Furthermore, our study demonstrated the significance of generating pseudo-labels using cross-task prompting, presenting a potential solution for situations where annotated data is scarce. The observed correlation between the impact of cross-task examples and the similarity in model activations between source and target input tokens offers valuable insights into the underlying mechanisms of this phenomenon.

Limitations

Despite presenting a potential future direction towards training-free task generalization using LLMs, this study has some important limitations. It is evident that the similarity between the source and the target tasks plays an important role in the performance. Hence, in real-world scenarios where the task novelty is extreme, such a method may fail to provide suitable performance. This is directly related to the fact that ours is the first study in this direction, and we have primarily focused on the empirical viability. A deeper understanding of generalizable task information captured inside the LLM circuits would help to come up with sophisticated solutions. Task novelty in our discussion does not presuppose access to novel knowledge. Hence, one can not mitigate the gap if a novel task requires the model to access newer information not present in the pretraining data or the in-context examples.

References

Appendix A Dataset Details

We have used the following datasets as source datasets:

ARC-Easy: ARC-Easy Clark et al. (2018) is a part of the ARC (AI2 Reasoning Challenge) dataset consisting of easy natural science questions targeted for students of 3rd to 9th grade. The questions are of multiple-choice format, where one of the four given options is correct. The training set consists of 2251 questions that we use for selecting our source examples.

AG-news: AG-news Zhang et al. (2015) is a text classification dataset containing news articles categorized into four classes - sports, business, technology, and world. It has a training set size of 120K, from which we have randomly sampled 10K news articles to construct the source dataset for our experiments.

BoolQ: BoolQ Clark et al. (2019) is a reading comprehension dataset consisting of yes/no questions asked from a passage given for each question. The questions are mostly non-factoid, and considerable inference ability is required to answer them based on the passages provided. In our usage, each question is labeled as either true or false. The training set consists of 9427 labeled question-passage pairs from which we select the source examples.

Commonsense-QA: Commonsense-QA Talmor et al. (2019) is a commonsense question-answering dataset that consists of multiple-choice questions where one of the five options provided is correct. To answer the questions, logical reasoning abilities and in some cases prior knowledge are required. The training set consists of 9740 labeled questions from which source examples are chosen by us.

Conll2003-POS: Conll2003-POS Tjong Kim Sang and De Meulder (2003) is a collection of the data which is a part of the CoNLL-2003 shared task. In this dataset, each sentence is labeled as a sequence of part-of-speech (POS) tags (each token is assigned a POS tag). We construct our source dataset by sampling 10K sentences from the 14,041 sentences in the training set.

Conll2003-NER: Conll2003-NER Tjong Kim Sang and De Meulder (2003) is also a part of the CoNLL-2003 shared task. Here, each sentence is labeled as a sequence of named entity tags. The task in this case is to perform named-entity recognition (NER). IOB tagging scheme Ramshaw and Marcus (1995) is used for assigning the tags. Four types of entities are assumed to be present in the data – persons (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). The training set consists of 14,041 sentences of which 10K are sampled by us as our source data.

MNLI: Multi-Genre Natural Language Inference (MNLI) Williams et al. (2018) corpus is one of the largest available resources for natural language inference. The corpus consists of 393K labeled examples, where each example consists of a pair of sentences and the label is one among neutral, contradiction or entailment based on the relationship between their meanings. Our source dataset is constructed by sampling 10K examples from the 393K examples available.

QQP: Quora Question Pairs (QQP) Sharma et al. (2019) dataset is curated for the task of natural language understanding. This dataset consists of question pairs collected from the popular question-answering website Quora, and the task is to determine if the questions are duplicates of each other. The training set consists of 364K question pairs, each labeled as duplicate or not duplicate. We sample 10K labeled question pairs from the training set for our source data.

RACE: RACE Lai et al. (2017) is a reading comprehension dataset consisting of passages along with questions asked from them. The passages and questions are collected from English exams of school students aged between 12 to 18. The questions are of multiple-choice format where one of the four options is correct. Our source dataset is gathered by picking 10K passage-question pairs from 87.9K such pairs available in the training set.

SST2: SST2 Socher et al. (2013) dataset is a part of the Stanford Sentiment Treebank corpus. It contains movie review snippets collected from the Rotten Tomatoes website. Each review is labeled as positive or negative. The 10K reviews for our source dataset are sampled from the training set consisting of 67.3K labeled reviews.

A.2 Target datasets

We have used the following datasets as target datasets:

ARC-Challenge: ARC-Challenge Clark et al. (2018) is also a part of the ARC (AI2 Reasoning Challenge) dataset and it consists of hard natural science questions targeted for students of 3rd to 9th grade. Those questions that were answered incorrectly by both a retrieval-based algorithm and a word co-occurrence method are considered to be ‘hard’enough for inclusion in this dataset. The questions are of multiple-choice format, where one of the four options is correct. The test set consists of 1172 questions out of which 500 were randomly selected for our target dataset.

Social-i-QA: Social-i-QA Sap et al. (2019) is a commonsense reasoning dataset focusing on social situations. This dataset contains examples consisting of a social situation or action given as context and a multiple-choice question asked based on the context aimed at testing emotional and social intelligence. Each question has three options, out of which one is correct. We select 500 examples for our target data from the 1954 examples available in the validation set.

SciQ: SciQ Auer et al. (2023) (or, SciQA) is a scientific question-answering dataset consisting of questions from different areas of Physics, Chemistry and Biology. The questions are of multiple-choice format, where one of the four options is correct. The test set consists of 1000 questions, of which 500 are sampled to prepare our target dataset.

MedMCQA: MedMCQA Pal et al. (2022) is a multiple-choice question-answering dataset consisting of questions from post-graduate medical entrance exams in India. Each question has four options of which one is correct. Our target dataset is constructed by selecting 500 questions from the 4183 questions in the validation set.

Financial-Phrasebank: Financial-Phrasebank Malo et al. (2014) is a financial sentiment analysis dataset containing sentences mined from a corpus of English news on all listed companies in OMX Helsinki. Each sentence is labeled as one of the three categories – positive, negative, or neutral, based on the influence the news snippet may have on the stock price. The training set consists of 2264 labeled sentences, from which 500 are sampled for our target dataset.

In each case, for preparing our target dataset we have selected 500 examples. The selection, though random, is done in such a way that our target datasets are balanced, i.e. the number of examples with each of the different possible labels is almost equal.

Appendix B Semantic Context creation

One might assume that taking internal embeddings of the models, instead of an external model like Sentence-BERT, for extracting semantically similar examples might be better suited. However, we found that doing so for LLaMA-2 7B is computationally more expensive, with no significant gain in performance, as reported in Table 7.

We also experimented with Sentence-BERT and E5 Wang et al. (2022b) to test which external model produces superior semantically similar context $C$ for the LLMs to use. Table 8 shows the results for LLaMA-2 7B and GPT3.5 performance using $C$ from the E5 model. We proceeded with Sentence-BERT instead of E5 as our external model, as the former has already been used for a similar role in prior works Liu et al. (2022); Tanwar et al. (2023), and its performance is also comparable to that of E5.

Appendix C Force Decoding

Unlike greedy decoding, where the most probable token from the entire vocabulary of the model is the output, in force decoding the vocabulary is restricted to a set of tokens and the output is assigned out of these restricted tokens. Formally, for context $C$ and a target task $T$ , the output in the case of force decoding is:

where, $L_{T}$ is the label space of target task $T$ . The label space of all five target tasks is shown in Table 10.

Table 9 reports the performance of cross-task prompting for all source-target pairs using force decoding in LLaMA-2 7B and LLaMA-2 13B models. We noted that force decoding improves the performance of zero-shot prediction by a great margin. However, the same is not true for every source-target pair.

Appendix D Random Labeling

Recently, Wei et al. (2023) proposed using a random label space for pseudo-labeling examples of (target) task and utilise these to generate the context $C$ . We experimented with this setup as it is a better alternative to zero-shot prompting, but our results (Table 11) showed that the model output is random with such $C$ .

Appendix E Prompt details

We show a few examples of cross-task prompts in Table 13 and in-task combined with cross-task prompts in Table 14. Additionally, task definitions of source and target tasks are provided in Table 15 and Table 16 respectively.

Appendix F Error analysis

Table 17 contains detailed examples for erroneous predictions in cross-task prompting setup.

Appendix G Activation analysis on LLaMA-2 7B

Full correlation analysis is presented in Table 18 and Figure 5.