Using ChatGPT as an editor

I decided to submit a talk about GitHub code search to the final Strange Loop. It was close to the deadline, but I thought I had the nucleus of an idea that was worth submitting. I wrote an outline, but I was having trouble coming up with a short, punchy abstract that would sell the talk. As the CFP put it:

[The description] should sell your talk to attendees (and reviewers). The abstract should answer the attendee’s question “Why should I attend this talk? What will I learn?”

Since I had been playing around with LLMs, I decided to enlist ChatGPT to help me out.

Here is the original abstract I wrote on my own:

GitHub code search is the world’s largest publicly available code search engine, with over 60 million repositories indexed, including forks. It supports exact match searches, regular expressions, autocomplete, and go-to-definition. Building GitHub code search required entirely rethinking how GitHub did code search and some innovative software engineering. In this talk, I’ll share some lessons from building and scaling code search to bring these advanced features to all GitHub users.

And here is the final version, produced by going back and forth with ChatGPT:

In this talk, I’ll share some lessons we learned building a high-performance code search engine, designed to meet GitHub’s large scale. GitHub code search is the world’s largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique content-addressable nature of Git repositories to our advantage. I’ll cover the key strategies we used, including using deduplication and repository similarity to reduce indexing workload, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.

I like this version better for sure! And apparently the reviewers did, too, because it was accepted.

Working with ChatGPT was interesting, though. When I work with a person to edit something, I feel like we iteratively narrow in on a product we both like. ChatGPT on the other hand, always suggested something – including contradicting suggestions it previously made.

For example, in one suggested edit, ChatGPT wrote:

We’ll share the lessons we learned while building a custom search engine from scratch, designed to meet GitHub’s unique scale and feature requirements.

I incorporated that language into my next edit and asked it to critique it. ChatGPT replied:

Overall, the abstract is clear and concise, but there are some areas where it can be improved:

“designed to meet GitHub’s unique scale and feature requirements” - This phrase could be reworded to sound less like marketing jargon. Instead, you could say “designed to handle GitHub’s massive scale and specific feature needs”.

Haha, thanks buddy!

In other areas, it helped clarify my thinking. I had a vague idea about using “data shape” to do less work. ChatGPT was having none of it:

“I’ll cover the key strategies we used, including understanding data shape to do less work” - This phrase could be clearer. What does “understanding data shape” mean? Perhaps you could say “leveraging the structure of Git repositories to reduce indexing workload”.

I fixed that by being more specific about what I was trying to get across: that Git repositories are content addressable, and we used that property to make indexing efficient.

If you’re planning to incorporate ChatGPT into your workflow, be aware of its tendency towards self-aggrandizement. Ben Hoyt recently pointed this out in his blog post about using LLMs when applying for a job. In the first abstract I asked it to write, ChatGPT started with “Discover the secrets behind GitHub’s world-renowned code search engine…”. I told it to stop doing that and it toned down the BS for the rest of the session.

One thing that editing with ChatGPT had in common with self-editing is that it took forever. I went through several attempts at rewriting my abstract, pasting it back in to get feedback, and then trying to incorporate the feedback. It had a lot in common with the typical writing workflow of staring at the page and sweating blood until the words come out.

I’m not ready to ditch a human reader yet, but I did find this useful, especially so close to the submission deadline.

If you’re interested, the full transcript can be found below.


Luke: I’m submitting a talk titled “Lessons from scaling GitHub code search” to the Strange Loop conference in the “Database and Distributed Systems” track. I need to write a description for the talk based on these instructions: “The abstract should answer the attendee’s question “Why should I attend this talk? What will I learn?” The sweet spot is 3-5 sentences. An attendee may be comparing 5 concurrent talks and they want to know whether it’s relevant to them”.

Please help me write a compelling 5 sentence description for the talk that will engage potential attendees based on the following outline:

Introduction

  • If you’ve used GitHub code search over the past 10 years, you may have noticed that it wasn’t very good
  • Even back when code search launched 10 years ago, people wanted to fix that, but despite several attempts, we couldn’t deliver a working solution
  • To solve that, we had to go back to the drawing board and build a custom code search engine from scratch, engineered to meet GitHub’s scale and feature requirements
  • Today, I’m going to share some of the lessons we learned building this system

Make indexing FAST

  • Indexing speed matters a lot, because testing changes often requires reindexing
  • This was a major reason why the old GitHub code search never got better: It was too expensive in time and money to try things
  • Scale example (billions of documents, TB of content in the index, etc.)
  • We made it fast by rewriting it in Rust. Just kidding. Sort of.
  • While the core search index is written in Rust and can process over 150,000 documents per second, some key insights into our data are what really allow us to scale

Index less data

  • The old code search had to do a lot of work indexing duplicate content
  • Example of “flattening” a Git tree to documents
  • In the new project, we use the structure of Git repositories to our advantage by deduplicating content
  • Indexing deltas between repositories also reduces index size

Clean up after yourself

  • Search indexes are typically implemented as append only structures. Deletions are marked, and then removed during a “compaction” phase
  • The old code search had major problems with this, because Lucene will only compact up to a maximum segment size. This left deleted documents in our index, hurting performance.
  • Our solution implements full compaction, so we completely remove deleted documents regularly and move

Share the load

  • We use sharding at multiple levels to get the best utilization out of our servers for search
  • Sharding by blob SHA (mentioned above) spreads load across the cluster (though it does mean we have to contact every shard for every search request)
  • On a host, the index’s posting lists are sharded as well
  • Compaction responsibilities are also shared within the cluster using a dynamic shard assignment algorithm
  • Each host holds two full shards, and we use the Ford-Fulkerson maximum flow algorithm to distribute requests

Conclusion

By taking a fresh look at our data, we were able to build a search solution that works for GitHub’s scale, and have finally brought powerful code search features to our users.

ChatGPT: In this talk, you’ll learn about the lessons we learned while building a custom code search engine for GitHub. We’ll cover key strategies for scaling search, such as indexing only necessary data, fast indexing, and efficient indexing cleanup. We’ll also discuss our use of sharding at multiple levels to distribute the load across our servers. Come discover how we turned our code search from a frustrating experience to a powerful feature for our users.

Luke: Not bad! Here’s the version I came up with by myself:

GitHub code search is the world’s largest publicly available code search engine, with over 60 million repositories indexed, including forks. It supports exact match searches, regular expressions, autocomplete, and go-to-definition. Building GitHub code search required entirely rethinking how GitHub did code search and some innovative software engineering. In this talk, I’ll share some lessons from building and scaling code search to bring these advanced features to all GitHub users.

Please edit my version.

ChatGPT: Your version is a good start! Here’s a slightly revised version that emphasizes the specific strategies you used to scale and improve GitHub’s code search:

Discover the secrets behind GitHub’s world-renowned code search engine, which indexes over 60 million repositories, including forks, and supports advanced search features like exact matches, regular expressions, and go-to-definition. In this talk, we’ll share the lessons we learned while rebuilding GitHub’s code search from scratch, using innovative software engineering techniques to achieve lightning-fast indexing speeds and efficient data storage. You’ll discover how we optimized our indexing process to handle billions of documents and terabytes of content, and how we use sharding to share the load across our servers. By the end of the talk, you’ll understand the key strategies for scaling a code search engine and how to deliver powerful search capabilities to your users.

Luke: Rewrite the above response but use less buzzwords like “innovative” and “world-renowned”.

ChatGPT: Learn about the inner workings of GitHub’s code search engine, which indexes over 60 million repositories and supports advanced search features such as exact matches, regular expressions, and go-to-definition. Join us as we share the lessons we learned from rebuilding the code search engine from scratch, focusing on strategies for lightning-fast indexing speeds and efficient data storage. We’ll discuss how we optimized our indexing process to handle billions of documents and terabytes of content, and how we use sharding to distribute the load across our servers. Discover key approaches for scaling a code search engine and delivering powerful search capabilities to your users.

Luke: Critique this abstract:

In this talk, I’ll share the lessons we learned building a custom code search engine for GitHub. GitHub code search is the world’s largest publicly available code search engine, indexing more than 60 million repositories and over 160 TB of code. To build GitHub code search, we had to entirely rethink our indexing to leverage the unique nature of Git. We’ll cover the key strategies we used for building a performant search engine, including indexing less data, index compaction, sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.

ChatGPT: This abstract is informative and covers the main points of the talk, but it could benefit from more specific details to make it more compelling. Here’s a revised version that expands on some of the topics and adds more concrete examples:

Join us as we delve into the world of GitHub code search, the largest publicly available code search engine, with over 60 million repositories and 160 TB of code indexed. We’ll share the lessons we learned while building a custom search engine from scratch, designed to meet GitHub’s unique scale and feature requirements. Our journey required us to rethink indexing entirely and leverage the unique nature of Git repositories to our advantage. Learn about our key strategies for building a high-performance search engine, such as indexing only necessary data, implementing full index compaction to remove deleted documents, and utilizing sharding to balance the load across our servers. We’ll also explore how we optimized indexing speed by rewriting parts of the search engine in Rust and leveraging our understanding of our data to scale. Come discover how we transformed code search into a powerful feature that empowers GitHub users every day.

Luke: Critique this version:

In this talk, I’ll share some lessons we learned building a high-performance code search engine, designed to meet GitHub’s unique scale and feature requirements. GitHub code search is the world’s largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique nature of Git repositories to our advantage. I’ll cover the key strategies we used, including understanding data shape to do less work, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.

If you rewrite it, keep it the same length.

ChatGPT: Overall, the abstract is clear and concise, but there are some areas where it can be improved:

  • “designed to meet GitHub’s unique scale and feature requirements” - This phrase could be reworded to sound less like marketing jargon. Instead, you could say “designed to handle GitHub’s massive scale and specific feature needs”.
  • “To build it, we had to turn the unique nature of Git repositories to our advantage.” - This sentence could benefit from more detail. What was the unique nature of Git repositories that you used to your advantage? This could be explained more clearly or elaborated upon in the talk.
  • “I’ll cover the key strategies we used, including understanding data shape to do less work” - This phrase could be clearer. What does “understanding data shape” mean? Perhaps you could say “leveraging the structure of Git repositories to reduce indexing workload”.
  • “Come discover how we turned code search from a frustrating experience to a powerful feature for our users.” - This sentence is a good call to action, but it could be more specific about the benefits attendees will gain from the talk.

Here’s a possible rewrite that addresses these points:

In this talk, I’ll share some lessons we learned building a high-performance code search engine that handles GitHub’s massive scale and specific feature needs. With over 60 million repositories and 160 TB of content indexed, we had to leverage the structure of Git repositories to reduce indexing workload and create a faster, more efficient search engine. I’ll cover key strategies such as index compaction to remove deleted documents, multiple levels of sharding, and load balancing to optimize search performance. By attending, you’ll discover how these techniques turned code search from a frustrating experience to a powerful feature, improving productivity for GitHub users.