RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, Maosong Sun

cs.CL cs.AI

Introduction

Developers typically spend approximately 58% of their time on program comprehension, and high-quality code documentation plays a significant role in reducing this time (Xia et al., 2018; de Souza et al., 2005). However, maintaining code documentation also consumes a considerable amount of time, money, and human labor (Zhi et al., 2015), and not all projects have the resources or enthusiasm to prioritize documentation as their top concern.

To alleviate the burden of maintaining code documentation, early attempts at automatic documentation generation aimed to provide descriptive summaries for source code (Sridhara et al., 2010; Rai et al., 2022; Khan and Uddin, 2022; Zhang et al., 2022), as illustrated in Figure 1. However, they still have significant limitations, particularly in the following aspects: (1) Poor summarization. Previous methods primarily focused on summarizing isolated code snippets, overlooking the dependencies of code within the broader repository-level context. The generated code summaries are overly abstract and fragmented, making it difficult to accurately convey the semantics of the code and compile the code summaries into documentation. (2) Inadequate guidance. Good documentation not only accurately describes the code’s functionality, but also meticulously guides developers on the correct usage of the described code (Khan and Uddin, 2022; Wang et al., 2023). This includes, but is not limited to, clarifying functional boundaries, highlighting potential misuses, and presenting examples of inputs and outputs. Previous methods still fall short of offering such comprehensive guidance. (3) Passive update. Lehman’s first law of software evolution states that a program in use will continuously evolve to meet new user needs (Lehman, 1980). Consequently, it is crucial for the documentation to be updated in a timely manner to align with code changes, which is the capability that previous methods overlook. Recently, Large Language Models (LLMs) have made significant progress (OpenAI, 2022, 2023), especially in code understanding and generation (Nijkamp et al., 2023; Li et al., 2023; Chen et al., 2021; Rozière et al., 2023; Xu et al., 2024; Sun et al., 2023; Wang et al., 2023; Khan and Uddin, 2022). Given these advancements, it is natural to ask: Can LLM be used to generate and maintain repository-level code documentation, addressing the aforementioned limitations?

In this study, we introduce RepoAgent, the first framework powered by LLMs, designed to proactively generate and maintain comprehensive documentation for the entire repository. A running example is demonstrated in Figure 1. RepoAgent offers the following features: (1) Repository-level documentation: RepoAgent leverages the global context to deduce the functional semantics of target code objects within the entire repository, enabling the generation of accurate and semantically coherent structured documentation. (2) Practical guidance: RepoAgent not only describes the functionality of the code but also provides practical guidance, including notes for code usage and examples of input and output, thereby facilitating developers’ swift comprehension of the code repository. (3) Maintenance automation: RepoAgent can seamlessly integrate into team software development workflows managed with Git and proactively take over documentation maintenance, ensuring that the code and documentation remain synchronized. This process is automated and does not require human intervention.

We qualitatively showcased the code documentation generated by RepoAgent for real Python repositories. The results reveal that RepoAgent is adept at producing documentation of a quality comparable to that created by humans. Quantitatively, in two blind preference tests, the documentation generated by RepoAgent was favored over human-authored documentation, achieving preference rates of 70% and 91.33% on the Transformers and LlamaIndex repositories, respectively. These evaluation results indicate the practicality of the proposed RepoAgent in automatic code documentation generation.

RepoAgent

RepoAgent consists of three key stages: global structure analysis, documentation generation, and documentation update. Figure 2 shows the overall design of RepoAgent. The global structure analysis stage involves parsing necessary meta information and global contextual relationships from the source code, laying the foundation for RepoAgent to infer the functional semantics of the target code. In the documentation generation stage, we have designed a sophisticated strategy that leverages the parsed meta information and global contextual relationships to prompt the LLM to generate fine-grained documentation that is of practical guidance. In the documentation update stage, RepoAgent utilizes Git tools to track code changes and update the documentation accordingly, ensuring that the code and documentation remain synchronized throughout the entire project lifecycle.

An essential prerequisite for generating accurate and fine-grained code documentation is a comprehensive understanding of the code structure. To achieve this goal, we proposed a project tree, a data structure that maintains all code objects in the repository while preserving their semantic hierarchical relationships. Firstly, we filter out all non-Python files within the repository. For each Python file, we apply Abstract Syntax Tree (AST) analysis (Zhang et al., 2019) to recursively parse the meta information of all Classes and Functions within the file, including their type, name, code snippets, etc. These Classes and Functions associated with their meta information are used as the atomic objects for documentation generation. It is worth noting that the file structures of most well-engineered repositories have reflected the functional semantics of code. Therefore, we first utilize it to initialize the project tree, whose root node represents the entire repository, middle nodes and leaf nodes represent directories and Python files, respectively. Then, we add the parsed Classes and Functions as new leaf nodes (or sub-trees) to the corresponding Python file nodes to form the final project tree.

Beyond the code structure, the reference relationships within the code, as a form of important global contextual information, can also assist the LLM in identifying the functional semantics of the code. Also, references to a target function can be considered natural in-context learning examples (Wei et al., 2022) to teach the LLM to use the target function, thereby helping generate documentation that is of practical guidance. We consider two types of reference relationships: Caller and Callee. We use the Jedi library111https://github.com/davidhalter/jedi Extensible to programming languages other than Python by replacing code parsing tools. to extract all bi-directional reference relationships in the repository, and then ground them to the corresponding leaf nodes in the project tree. The project tree augmented with the reference relationships forms a Directed Acyclic Graph222We simply ignored circular dependencies to avoid loops, as most of these situations may have bugs. (DAG).

2 Documentation Generation

RepoAgent aims to generate fine-grained documentation that is of practical guidance, which includes detailed Functionality, Parameters, Code Description, Notes, and Examples. A backend LLM leverages the parsed meta information and reference relationships from the previous stage to generate documentation with the required structure using a carefully designed prompt template. An illustrative prompt template is shown in Figure 3, and a complete real-world prompt example is given in LABEL:lst:prompt_template.

The prompt template mainly requires the following parameters: The Project Tree helps RepoAgent perceive the repository-level context. The Code Snippet serves as the main source of information for RepoAgent to generate the documentation. The Reference Relationships provide semantic invocation relationships between code objects and assist RepoAgent in generating guiding notes and examples. The Meta Information indicates the necessary information such as type, name, relative file path of the target object, and is used for post-processing of the documentation. Additionally, we can include the previously generated Documentation of a direct child node of an object as auxiliary information to help code understanding. This is optional, as omitting it can save costs significantly.

RepoAgent follows a bottom-to-top topological order to generate documentation for all code objects in the DAG, ensuring that the child nodes of each node, as well as the nodes it references, have their documentation generated before it. After the documentation is generated, RepoAgent compiles it into a human-friendly Markdown format. For example, objects of different levels are associated with different Markdown headings (e.g., ##, ###). Finally, RepoAgent utilizes GitBook333https://www.gitbook.com/ to render the Markdown formatted documentation into a convenient web graphical interface, which enables easy navigation and readability for documentation readers.

3 Documentation Update

RepoAgent supports automatic tracking and updating of documentation through seamless collaboration with Git. The pre-commit hook of Git is utilized to enable RepoAgent to detect any code changes and perform documentation updates. After the update, the hook submits both the code and documentation changes, ensuring that the code and documentation remain synchronized. This process is fully automated and does not require human intervention.

Local code changes generally do not affect other code due to the low coupling principle, it is not necessary to regenerate the entire documentation with each minor code update. RepoAgent only updates the documentation of affected objects. The updates are triggered when (1) an object’s source code is modified; (2) an object’s referrers no longer reference it; or (3) an object gets new references. It is worth noting that the update is not triggered when an object’s reference objects change, because we adhere to the dependency inversion principle (Martin, 1996), which states that high-level modules should not depend on the implementations of low-level modules.

Experiments

We selected 9 Python repositories of varying scales for documentation generation, ranging from less than 1,000 to over 10,000 lines of code. These repositories are renowned for their classic status or high popularity on GitHub, and are characterized by their high-quality code and considerable project complexity. The detailed statistics of the repositories are provided in § A.1. We adopted the API-based LLMs gpt-3.5-turbo (OpenAI, 2022) and gpt-4-0125 (OpenAI, 2023), along with the open-source LLMs Llama-2-7b and Llama-2-70b (Touvron et al., 2023) as backend models for RepoAgent .

2 Case Study

We use the ChatDev repository Qian et al. (2023) and the gpt-4-0125 backend for a case study. The generated documentation is illustrated in Figure 4. Documentation generated by RepoAgent is structured into several parts, starting with a clear, concise sentence that articulates the object’s functionality. Following this, the parameters section enumerates all relevant parameters along with their descriptions, aiding developers in understanding how to leverage the provided code. Moreover, the code description section comprehensively elaborates on all aspects of the code, implicitly or explicitly demonstrating the object’s role and its associations with other code within the global context. In addition, the notes section further enriches these descriptions by covering usage considerations for the object at hand. Notably, it highlights any logical errors or potential optimization within the code, thereby prompting advanced developers to make modifications. Lastly, if the current object yields a return value, the model will generate an examples section, filled with simulated content to clearly demonstrate the expected output. This is highly advantageous for developers, facilitating efficient code reuse and unit test construction.

Once the code is changed, the documentation update will be triggered, as illustrated in Figure 5. Upon code changes in the staging area, RepoAgent identifies affected objects and their bidirectional references, updates documentation for the minimally impacted scope, and integrates these updates into a new Markdown file, which includes additions or global removals of objects’ documentation. This automation extends to integrating the pre-commit hook of Git to detect code changes and update documentation, thus seamlessly maintaining documentation alongside project development. Specifically, when code updates are staged and committed, RepoAgent is triggered, automatically refreshing the documentation and staging it for the commit. It confirms the process with a “Passed” indicator, without requiring extra commands or manual intervention, preserving developers’ usual workflows.

3 Human Evaluation

We adopted human evaluation to assess the quality of generated documentation due to the lack of effective evaluation methods. We conducted a preference test to compare human-authored and model-generated code documentation. We randomly sampled 150 pieces of documentation content, including 100 class objects and 50 function-level objects, from both the Transformers and LlamaIndex repositories respectively. Three evaluators were recruited to assess the quality of both documentation sets, with the detailed evaluation protocol outlined in § A.2.2. The results, presented in § 3.3, underscore RepoAgent’s notable effectiveness in producing documentation that surpasses human-authored content, achieving win rates of $0.70$ and $0.91$ , respectively.

lcccc Total Human Model Win Rate Transformers 150 45 105 0.70 LlamaIndex 150 13 137 0.91

We evaluated the models’ perception of global context by calculating the recall for identifying reference relationships of code objects. We sampled 20 objects from each of 9 repositories and compared 3 documentation generation methods for their recall in global caller and callee identification. The comparison methods included a machine learning based method that uses LSTM for comment generation (Iyer et al., 2016), long context concatenation leveraging LLMs with up to 128k context lengths to process entire project codes for identifying calling relationships, single-object generation method that only provides code snippets to LLMs.

Figure 6 demonstrates the recall for identifying reference relationships. The machine learning based method is unable to identify reference relationships, whereas the Single-object method partially identifies callees but not callers. The Long Context method, despite offering extensive code content, achieves only partial and non-comprehensive recognition of references, with recall declining as context increases. In contrast, our approach utilizes deterministic tools Jedi and bi-directional parsing to accurately convey global reference relationships, effectively overcoming the scope limitations that other methods encounter in generating repository-level code documentation.

lcccc Repository Llama-2-7b Llama-2-70b gpt-3.5-turbo gpt-4-0125 unoconv 0.0000 0.5000 1.0000 1.0000 simdjson 0.4298 0.6336 1.0000 0.9644 greenlet 0.5000 0.7482 0.9252 0.9615 code2flow 0.5145 0.6171 0.9735 0.9803 AutoGen 0.3049 0.5157 0.8633 0.9545 AutoGPT 0.4243 0.5611 0.8918 0.9527 ChatDev 0.5387 0.6980 0.9164 0.9695 MemGPT 0.4582 0.5729 0.9285 0.9911 MetaGPT 0.3920 0.5819 0.9066 0.9708

Adherence to the format is critical in documentation generation. The generated documentation should consist of 5 basic parts, where the Examples is dynamic, depending on whether the code object has a return value or not. We evaluated the ability of LLMs to adhere to the format using all 9 repositories, the results are shown in Figure 7. Large models like GPT series and Llama-2-70b perform very well in format alignment, while the small model Llama-2-7b performs poorly, especially in terms of the examples.

We further evaluated the models’ capability to identify parameters on all 9 repositories, the results are shown in § 3.4. It is worth noting that we report the accuracy instead of recall, because models may hallucinate non-existent parameters, which should be taken into account. As seen in the table, the GPT series significantly outperforms the LLaMA series in parameter identification, and gpt-4-0125 performs the best.

The field focuses on generating succinct, human-readable code summaries. Early methods were rule-based or template-driven Haiduc et al. (2010); Sridhara et al. (2010); Moreno et al. (2013); Rodeghero et al. (2014). With advancements in machine learning, learning-based approaches like CODE-NN, which utilize LSTM units, emerged for summary creation Iyer et al. (2016). The field further evolved with attention mechanisms and transformer models, significantly enhancing the ability to model long-range dependencies Allamanis et al. (2016); Vaswani et al. (2017), indicating a shift towards more context-aware and flexible summarization techniques.

Conclusion and Discussion

The development and application of LLMs have revolutionized both NLP and software engineering fields. Initially, the field was transformed by masked language models like BERT Devlin et al. (2019), followed by advancements in encoder-decoder models, such as the T5 series Raffel et al. (2020), and auto-regressive models like the GPT series Radford et al. (2018). Auto-regressive models, notable for their sequence generation capabilities, have been effectively applied in code generation Nijkamp et al. (2023); Li et al. (2023); Chen et al. (2021); Rozière et al. (2023); Xu et al. (2024), code summarization Sun et al. (2023), and documentation generation Wang et al. (2023); Khan and Uddin (2022), highlighting their versatility in programming and documentation tasks. Concurrently, LLM-based agents have become ubiquitous XAgent (2023); Qin et al. (2024); Lyu et al. (2023); Ye et al. (2023); Qin et al. (2023), especially in software engineering Chen et al. (2024); Qian et al. (2023); Hong et al. (2024), facilitating development through role-play and the automatic generation of agents Wu et al. (2023), thereby enhancing repository-level code understanding, generation and even debugging Tian et al. (2024). With the development of LLM-based agents, repository-level documentation generation become solvable as an agent task.

In this paper, we introduce RepoAgent, an open source framework designed to generate fine-grained repository-level code documentation, facilitating improved team collaboration. The experimental results suggest that RepoAgent is capable of generating and proactively maintaining high-quality documentation for the entire project. RepoAgent is expected to free developers from this tedious task, thereby improving their productivity and innovation potential.

In future work, we consider how to effectively utilize this tool and explore ways to apply RepoAgent to a broader range of downstream applications in the future. To this end, we believe that chatting can serve as a natural tool for establishing a communication bridge between code and humans. Currently, by employing our approach with retrieval-augmented generation, which combines code, documentation, and reference relationships, we have achieved preliminary results in what we called “Chat With Repo”, which marks the advent of a novel coding paradigm.

RepoAgent currently relies on the Jedi reference recognition tool, limiting its applicability exclusively to Python projects. A more versatile, open-source tool that can adapt to multiple programming languages would enable broader adoption across various codebases, which will be addressed in future iterations.

AI-generated documentation may still require human review and modification to ensure its accuracy and completeness. Technical intricacies, project-specific conventions, and domain-specific terminology may necessitate manual intervention to enhance the quality of generated documentation.

The performance of RepoAgent significantly depends on the backend LLMs and associated technologies. Although current results have shown promising progress with API-based LLMs like GPT series, the long-term stability and sustainability of using open-source models still require further validation and research.

Broader Impact

It is difficult to establish a unified quantitative evaluation method for the professionalism, accuracy, and standardization of generated documentation. Furthermore, it is worth noting that the academic community currently lacks benchmarks and datasets of exemplary human documentation. Additionally, the subjective nature of documentation further limits current methods in terms of quality assessment.

RepoAgent automates the generation, update and maintenance of code documentation, which is traditionally a time-consuming task for developers. By freeing developers from this burden, our tool not only enhances productivity but also allows more time for creative and innovative work in software development.

High-quality documentation is crucial for understanding, using, and contributing to software projects, facilitating developers’ swift comprehension of projects. RepoAgent ’s ability ensures long-term high consistency in code documentation. We posit that integrating RepoAgent closely with the project development process can introduce a new paradigm for standardizing and making repositories more readable. This, in turn, is expected to stimulate active community contributions and rapid development with higher overall quality of software projects.

RepoAgent can serve as an educational tool by providing clear and consistent documentation for codebases, making it easier for students and novice programmers to learn software development practices and understand complex codebases.

While RepoAgent aims to generate high-quality documentation, there’s a potential risk of generating biased or inaccurate content due to model hallucination.

Acknowledgments

Currently, RepoAgent mainly relies on remote API-based LLMs, which will have the opportunity to access users’ code data. This may raise security and privacy concerns, especially for proprietary software. Ensuring data protection and secure handling of the code is crucial.

We appreciate the suggestions and assistance from all the fellow students and friends in the community, including Arno (Bangsheng Feng), Guo Zhang, Qiang Guo, Yang Li, Yang Jiao, and others.

Appendix A Appendix: Experimental Details

Table A.2.2 presents the detailed statistics of the selected repositories and the token costs associated with the production of initial documentation. These repositories are sourced from both well-established, highly-starred projects and newly emerged, top-performing projects on GitHub. Repositories are characterized by their numbers of lines of code, classes and functions. Including global information like the project’s directory structure and bidirectional references results in very long prompts (as detailed in Appendix C). Despite this, the resulting documentation is thorough yet concise, typically ranging between 0.4k and 1k tokens in length.

During the actual generation process, we addressed the issue of varying text lengths across different models. When using models with shorter context lengths (e.g., gpt-3.5-turbo and the LLaMA series), RepoAgent adaptively switches to models with larger context lengths (e.g., gpt-3.5-16k or gpt-4-32k) based on the current prompt’s length, to cope with the token overhead of incorporating global perspectives. In cases where even these models’ limits are exceeded, RepoAgent truncates the content by simplifying the project’s directory structure and removing bidirectional reference code before reinitiating the documentation generation task. Such measures are infrequent when employing models with the longest contexts (128k), such as gpt-4-1106 or gpt-4-0125. This dynamic scheduling strategy, combined with variable network conditions, may influence token consumption. Nevertheless, RepoAgent ensures the integrity of the documentation while striving for cost-effectiveness to the greatest extent.

A.2 Settings

All experiments were conducted within a Python 3.11.4 environment. The system had CUDA 11.7 installed and was equipped with 8 NVIDIA A100 40GB GPUs.

A.2.2 Human Evaluation Protocol

We recruited three human evaluators to assess the code documentation generated by RepoAgent, and instructed all human evaluators to give an overall evaluation considering a set of evaluation criteria shown in Table 4. We randomly sampled 150 pieces of documentation from the repository. Subsequently, each human evaluator was assigned 50 pairs of documentation, each containing one human-authored and one model-generated documentation. The human evaluators were required to select the better documentation for each pair.

lcccccc Repository Model Prompt Tokens Completion Tokens Class Numbers Function Numbers Code Lines

unoconv gpt-4-0125 4020 2550 0 1 $\leq$ 1k gpt-3.5-turbo 2743 Llama-2-7b 1180 2916 Llama-2-70b 437

simdjson gpt-4-0125 45344 35068 6 55 $\leq$ 1k gpt-3.5-turbo 29736 Llama-2-7b 49615 27562 Llama-2-70b 32961

greenlet gpt-4-0125 86587 79113 59 319 1k $\leq$ 10k gpt-3.5-turbo 260464 Llama-2-7b 33177 31561 Llama-2-70b 225595

code2flow gpt-4-0125 185511 134462 51 257 1k $\leq$ 10k gpt-3.5-turbo 234101 Llama-2-7b 354574 431761 Llama-2-70b 187835

AutoGen gpt-4-0125 4939388 516975 64 590 1k $\leq$ 10k gpt-3.5-turbo 288609 Llama-2-7b 889050 630139 Llama-2-70b 410256

AutoGPT gpt-4-0125 4116296 888223 318 1170 $\geq$ 10k gpt-3.5-turbo 799380 Llama-2-7b 1838425 1893041 Llama-2-70b 927946

ChatDev gpt-4-0125 2021168 602474 183 729 $\geq$ 10k gpt-3.5-turbo 519226 Llama-2-7b 1122400 946131 Llama-2-70b 531838

MemGPT gpt-4-0125 628482 345109 74 478 $\geq$ 10k gpt-3.5-turbo 234101 Llama-2-7b 742591 740783 Llama-2-70b 352940

MetaGPT gpt-4-0125 154364 111159 291 885 $\geq$ 10k gpt-3.5-turbo 134101 Llama-2-7b 1904244 2265991 Llama-2-70b 1009996

The experiment aims to evaluate the model’s ability to perceive global context, which is reflected by the recall for identifying reference relationships. The comparison methods are:

ML-based method. Iyer et al. (2016) utilized traditional machine learning and deep learning methods for generating comments describing the functionality of code objects.

Long context concatenation. The method directly concatenates the code snippets until the context length reaches 128k to let the model discover reference relationships.

Single-object generation. Sun et al. (2023) used the GPT-3.5 series to generate documentation by directly feeding code snippets of the target object. We modified the prompt on this basis, adding requirements for outputting the callers and callees.

Notably, among these methods, only the ML-based approach failed to explicitly or implicitly manifest call relationships in the final document. While it is inherently challenging for a code snippet to discern its invocation throughout the entire repository, the code typically elucidates the current object’s calls explicitly. To measure the recall of callers and callees, we enhanced the original documentation by adding information about the calling functions (callers) and the called functions (callees). Then we compared the enriched documentation with our bidirectional reference data from MetaInfo.

For long context concatenation, we randomly selected 20 objects from each of the 9 repositories, culminating in a total of 180 objects. Given the intricate nature of defining context construction criteria for repository-level documentation generation tasks, we circumvented direct concatenation of adjacent and file-adjacent context content. Instead, we formulated negative samples by extracting all objects with reference relationships to fulfill the context length. Leveraging the content of objects and negative sample content, we devised context lengths for the 180 objects, spanning from 29 to 6.0k Code Lines. This approach aimed to optimize the distribution of context lengths while maximizing the utilization of the model’s context length. In the case of single-object generation method, we utilized the same pool of 180 objects, providing the model with object source code snippets to generate documentation and elucidate reference relationships.

During the evaluation of both the Long Context Concatenation and Single-object Generation methods, we provided the model with tree-structured hierarchical position information for target objects and their related counterparts. This additional information was intended to help the model in better identifying callers and delineating them in a path form. Despite this assistance, the model’s misinterpretations exacerbated as the context length increased, and the Single-object Generation method yielded a substantial amount of speculative information, resulting in unstable and inaccurate caller relationship recognition.

The experiment evaluates whether the model-generated documentation follows the defined format. LLMs generally excel in instruction following, but the complexity of our task requires models to grasp core intents within lengthy prompts, posing a challenge. We use a one-shot approach with strict output examples, enabling evaluation of model answers through format matching algorithms. Specifically, we mandate that section titles be enclosed in bold symbols, ensure clear divisions between sections, and require contents within sections to be extractable and meaningful.

We observed the shortcomings of open-source models (LLaMA series) in their ability to adhere to formatting. In contrast, the GPT-4 series models excellently achieve format integrity and stability. We also observed behavioral differences between gpt-4-0125 and gpt-4-1106 models, the former appeared to produce more redundant information.

Format alignment can also be achieved with perfect accuracy using hierarchical or modular generation methods. However, this approach introduces a significant token overhead since each independent module must encompass complete global information and invocation relationships. Current method has demonstrated satisfactory performance on format alignment, meeting human readability standards effectively.

Accurately identifying and describing parameters or attributes (depending on whether the current object is a function or a class) in code is crucial as it helps readers quickly understand the design logic and usage. We extracted recognized parameters from the Parameters section using a matching pattern: parameters follow a uniform and fixed format, with the parameter name enclosed in code identifiers followed by the parameter’s descriptive text.

We organized the extracted parameters into arrays and calculated accuracy by comparing them with the values in the params field (also an array) of the Repository’s MetaInfo. It is important to note that we were calculating accuracy here, not recall. This is because some models may hallucinate many nonexistent parameters based on the code snippets. These errors must be taken into consideration, otherwise they will result in biased evaluations.

In this section, we showcase additional generated documentation to validate the practical application of RepoAgent . The included images are direct screenshots from the documentation of two open-source projects, ChatDev and AutoGen. Our intent is to provide readers with a detailed and panoramic view of how our method is utilized in real-world scenarios, thereby offering a deeper understanding of its effectiveness and versatility.

Appendix C Appendix: Full Prompts

Appendix D Appendix: Chat With Repo

Moving beyond documentation generation, we are actively exploring how best to use RepoAgent and examining its potential for a broader range of downstream applications in the future. We categorize these applications as:

Automatic Q&A for Issues and Source Codes

Generation of Public Tutorial Documentation

We conceptualize “Chat With Repo” as a unified gateway for these downstream applications, acting as a connector that links RepoAgent to human users and other AI agents. Our future research will focus on adapting the interface to various downstream applications and customizing it to meet their unique characteristics and implementation requirements.

Here we demonstrate a preliminary prototype of Automatic Q&A for Issues and Code Explanation. A running example is shown in Figure 13. The program begins by pre-vectorizing code documentation and storing it in a vector database. When a query request is received, it is transformed into an embedding vector for fetching relevant documentation information from the database. This is followed by using the documentation’s MetaInfo to locate the pertinent source code, effectively retrieving relevant sections of both documentation text and source code. Moreover, beyond embedding search, a multi-way recall mechanism has been developed, incorporating entity recognition with keyword search. This involves extracting code entities from the user’s question using a LLM, and conducting searches across documentation and code repositories to match the top K returned documentation and code blocks. A weighting module has been developed for recalling the most relevant information. Additionally, we input directory tree information to help the model better understand the entire repository. The final step is to concatenate documentation and code blocks retrieved through both mechanisms, along with the target object’s parent code, referencing code, and directory tree information, into a prompt for the LLM to generate answers. This sophisticated RAG-based retrieval system bridges human natural language with code language, enabling precise recall at the repository level and paving the way for downstream applications.

A real world “Chat With Repo” example with input and output is shown as follows.