Molly White covers some of the code that was entered into evidence at the FTX trial, in the form of GitHub screenshots. The fraud was in the code was weird for me because I’ve never before seen a criminal case full of screenshots of software I work on.
Code can be hard to understand, but this code was strong evidence for the prosecution because it was more or less:
if is_almeda_research:
do_fraud()
New life goal: Never have my code entered into evidence at a criminal trial.
Doug Turnbull created a new Python package called SearchArray for experimenting with search relevance tuning.
It can supercharge any dataframe into a BM25-powered term/phrase index. Under the hood it’s a Pandas extension array backed by a traditional inverted index. Its tokenizers are just python functions that turn strings into lists of tokens. Its stemmers are just… boring python packages.
Previously, to run a search relevance experiment, I’d have to standup a bunch of systems. But now, with SearchArray, everything can just run in a single colab notebook.
I like this idea a lot. A Pandas-based search backend makes a lot of sense for small scale relevance experiments. When you’ve narrowed in on something that seems promising, then you can translate it to Lucene or whatever.
Classicist Emily Wilson surveys translations of one of the most famous scens in The Illiad:
In one of the most moving and memorable scenes from the “Iliad,” the great Trojan warrior Hector says farewell to his wife, Andromache, who has urged him not to risk his life by fighting on the plain. He gives their baby back to her, tells her to go home, and reiterates his decision to advance on the enemy….
Each of these translations — along with dozens more — suggests a different understanding of the central themes of courage, marriage, fate and death.
She includes translations spanning four centuries, including George Chapman (1611), Alexander Pope (1715), Samuel Butler (1898), Robert Fagles (1990), and her own recently published translation:
With these words,
he gave his son to his beloved wife.
She let him snuggle in her perfumed dress,
and tearfully she smiled. Her husband noticed
and pitied her. He took her by the hand
and said to her,
“Strange woman! Come on now,
you must not be too sad on my account.
No man can send me to the house of Hades
before my time. No man can get away
from destiny, first set for us at birth,
however cowardly or brave he is.
Go home and do the things you have to do.
Work on your loom and spindle and instruct
the slaves to do their household work as well.
War is a task for men — for every man
born here in Troy, but most especially, me.”
When he had finished speaking, glorious Hector
picked up his helmet with its horsehair plume.
His loving wife set off for home, but kept
twisting and turning back to look at him.
More and more tears kept flooding down her face.
Here’s a couple more translations of the same passage. Literature and History’s episode used Caroline Alexander’s 2015 translation:
So speaking he placed in the hands of his beloved wife
his son; and she took him to her perfumed breast,
laughing as she cried. And her husband took pity, watching,
and with his hand he caressed her and spoke to her and said her name;
“Foolish one, do not, I beg you, distress your heart too much.
No man against fate will hurl me to Hades;
for no man, I think, escapes destiny,
not the cowardly, nor the brave, once he is born.
But go to the house and tend to your work,
to your loom and distaff, and direct your handmaids
to ply their work; war is the concern of men,
all men, and me most of all, who live in Ilion.”
So speaking, shining Hector took up his helmet
crested with horsehair; and his beloved wife went home,
turning to look back all the while, letting the full tears fall.
Inspired by Literature and History, I’m in the middle of reading Stanley Lombardo’s 1997 translation. He renders the scene as:
And he put his son in the arms of his wife,
And she enfolded him in her fragrant bosom
Laughing through her tears. Hector pitied her
And stroked her with his hand and said to her:
“You worry too much about me, Andromache.
No one is going to send me to Hades before my time,
And no man has ever escaped his fate, rich or poor,
Coward or hero, once born into this world.
Go back to the house now and take care of your work,
The loom and shuttle, and tell the servants
To get on with their jobs. War is the work of men,
Of all the Trojan men, and mine especially.”
With these words, Hector picked up
His plumed helmet, and his wife went back home,
Turning around often, her cheeks flowered with tears.
Using ChatGPT as an editor
I decided to submit a talk about GitHub code search to the final Strange Loop. It was close to the deadline, but I thought I had the nucleus of an idea that was worth submitting. I wrote an outline, but I was having trouble coming up with a short, punchy abstract that would sell the talk. As the CFP put it:
[The description] should sell your talk to attendees (and reviewers). The abstract should answer the attendee’s question “Why should I attend this talk? What will I learn?”
Since I had been playing around with LLMs, I decided to enlist ChatGPT to help me out.
Here is the original abstract I wrote on my own:
GitHub code search is the world’s largest publicly available code search engine, with over 60 million repositories indexed, including forks. It supports exact match searches, regular expressions, autocomplete, and go-to-definition. Building GitHub code search required entirely rethinking how GitHub did code search and some innovative software engineering. In this talk, I’ll share some lessons from building and scaling code search to bring these advanced features to all GitHub users.
And here is the final version, produced by going back and forth with ChatGPT:
In this talk, I’ll share some lessons we learned building a high-performance code search engine, designed to meet GitHub’s large scale. GitHub code search is the world’s largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique content-addressable nature of Git repositories to our advantage. I’ll cover the key strategies we used, including using deduplication and repository similarity to reduce indexing workload, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.
I like this version better for sure! And apparently the reviewers did, too, because it was accepted.
Working with ChatGPT was interesting, though. When I work with a person to edit something, I feel like we iteratively narrow in on a product we both like. ChatGPT on the other hand, always suggested something – including contradicting suggestions it previously made.
For example, in one suggested edit, ChatGPT wrote:
We’ll share the lessons we learned while building a custom search engine from scratch, designed to meet GitHub’s unique scale and feature requirements.
I incorporated that language into my next edit and asked it to critique it. ChatGPT replied:
Overall, the abstract is clear and concise, but there are some areas where it can be improved:
“designed to meet GitHub’s unique scale and feature requirements” - This phrase could be reworded to sound less like marketing jargon. Instead, you could say “designed to handle GitHub’s massive scale and specific feature needs”.
Haha, thanks buddy!
In other areas, it helped clarify my thinking. I had a vague idea about using “data shape” to do less work. ChatGPT was having none of it:
“I’ll cover the key strategies we used, including understanding data shape to do less work” - This phrase could be clearer. What does “understanding data shape” mean? Perhaps you could say “leveraging the structure of Git repositories to reduce indexing workload”.
I fixed that by being more specific about what I was trying to get across: that Git repositories are content addressable, and we used that property to make indexing efficient.
If you’re planning to incorporate ChatGPT into your workflow, be aware of its tendency towards self-aggrandizement. Ben Hoyt recently pointed this out in his blog post about using LLMs when applying for a job. In the first abstract I asked it to write, ChatGPT started with “Discover the secrets behind GitHub’s world-renowned code search engine…”. I told it to stop doing that and it toned down the BS for the rest of the session.
One thing that editing with ChatGPT had in common with self-editing is that it took forever. I went through several attempts at rewriting my abstract, pasting it back in to get feedback, and then trying to incorporate the feedback. It had a lot in common with the typical writing workflow of staring at the page and sweating blood until the words come out.
I’m not ready to ditch a human reader yet, but I did find this useful, especially so close to the submission deadline.
If you’re interested, the full transcript can be found below.
Luke: I’m submitting a talk titled “Lessons from scaling GitHub code search” to the Strange Loop conference in the “Database and Distributed Systems” track. I need to write a description for the talk based on these instructions: “The abstract should answer the attendee’s question “Why should I attend this talk? What will I learn?” The sweet spot is 3-5 sentences. An attendee may be comparing 5 concurrent talks and they want to know whether it’s relevant to them”.
Please help me write a compelling 5 sentence description for the talk that will engage potential attendees based on the following outline:
Introduction
- If you’ve used GitHub code search over the past 10 years, you may have noticed that it wasn’t very good
- Even back when code search launched 10 years ago, people wanted to fix that, but despite several attempts, we couldn’t deliver a working solution
- To solve that, we had to go back to the drawing board and build a custom code search engine from scratch, engineered to meet GitHub’s scale and feature requirements
- Today, I’m going to share some of the lessons we learned building this system
Make indexing FAST
- Indexing speed matters a lot, because testing changes often requires reindexing
- This was a major reason why the old GitHub code search never got better: It was too expensive in time and money to try things
- Scale example (billions of documents, TB of content in the index, etc.)
- We made it fast by rewriting it in Rust. Just kidding. Sort of.
- While the core search index is written in Rust and can process over 150,000 documents per second, some key insights into our data are what really allow us to scale
Index less data
- The old code search had to do a lot of work indexing duplicate content
- Example of “flattening” a Git tree to documents
- In the new project, we use the structure of Git repositories to our advantage by deduplicating content
- Indexing deltas between repositories also reduces index size
Clean up after yourself
- Search indexes are typically implemented as append only structures. Deletions are marked, and then removed during a “compaction” phase
- The old code search had major problems with this, because Lucene will only compact up to a maximum segment size. This left deleted documents in our index, hurting performance.
- Our solution implements full compaction, so we completely remove deleted documents regularly and move
Share the load
- We use sharding at multiple levels to get the best utilization out of our servers for search
- Sharding by blob SHA (mentioned above) spreads load across the cluster (though it does mean we have to contact every shard for every search request)
- On a host, the index’s posting lists are sharded as well
- Compaction responsibilities are also shared within the cluster using a dynamic shard assignment algorithm
- Each host holds two full shards, and we use the Ford-Fulkerson maximum flow algorithm to distribute requests
Conclusion
By taking a fresh look at our data, we were able to build a search solution that works for GitHub’s scale, and have finally brought powerful code search features to our users.
ChatGPT: In this talk, you’ll learn about the lessons we learned while building a custom code search engine for GitHub. We’ll cover key strategies for scaling search, such as indexing only necessary data, fast indexing, and efficient indexing cleanup. We’ll also discuss our use of sharding at multiple levels to distribute the load across our servers. Come discover how we turned our code search from a frustrating experience to a powerful feature for our users.
Luke: Not bad! Here’s the version I came up with by myself:
GitHub code search is the world’s largest publicly available code search engine, with over 60 million repositories indexed, including forks. It supports exact match searches, regular expressions, autocomplete, and go-to-definition. Building GitHub code search required entirely rethinking how GitHub did code search and some innovative software engineering. In this talk, I’ll share some lessons from building and scaling code search to bring these advanced features to all GitHub users.
Please edit my version.
ChatGPT: Your version is a good start! Here’s a slightly revised version that emphasizes the specific strategies you used to scale and improve GitHub’s code search:
Discover the secrets behind GitHub’s world-renowned code search engine, which indexes over 60 million repositories, including forks, and supports advanced search features like exact matches, regular expressions, and go-to-definition. In this talk, we’ll share the lessons we learned while rebuilding GitHub’s code search from scratch, using innovative software engineering techniques to achieve lightning-fast indexing speeds and efficient data storage. You’ll discover how we optimized our indexing process to handle billions of documents and terabytes of content, and how we use sharding to share the load across our servers. By the end of the talk, you’ll understand the key strategies for scaling a code search engine and how to deliver powerful search capabilities to your users.
Luke: Rewrite the above response but use less buzzwords like “innovative” and “world-renowned”.
ChatGPT: Learn about the inner workings of GitHub’s code search engine, which indexes over 60 million repositories and supports advanced search features such as exact matches, regular expressions, and go-to-definition. Join us as we share the lessons we learned from rebuilding the code search engine from scratch, focusing on strategies for lightning-fast indexing speeds and efficient data storage. We’ll discuss how we optimized our indexing process to handle billions of documents and terabytes of content, and how we use sharding to distribute the load across our servers. Discover key approaches for scaling a code search engine and delivering powerful search capabilities to your users.
Luke: Critique this abstract:
In this talk, I’ll share the lessons we learned building a custom code search engine for GitHub. GitHub code search is the world’s largest publicly available code search engine, indexing more than 60 million repositories and over 160 TB of code. To build GitHub code search, we had to entirely rethink our indexing to leverage the unique nature of Git. We’ll cover the key strategies we used for building a performant search engine, including indexing less data, index compaction, sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.
ChatGPT: This abstract is informative and covers the main points of the talk, but it could benefit from more specific details to make it more compelling. Here’s a revised version that expands on some of the topics and adds more concrete examples:
Join us as we delve into the world of GitHub code search, the largest publicly available code search engine, with over 60 million repositories and 160 TB of code indexed. We’ll share the lessons we learned while building a custom search engine from scratch, designed to meet GitHub’s unique scale and feature requirements. Our journey required us to rethink indexing entirely and leverage the unique nature of Git repositories to our advantage. Learn about our key strategies for building a high-performance search engine, such as indexing only necessary data, implementing full index compaction to remove deleted documents, and utilizing sharding to balance the load across our servers. We’ll also explore how we optimized indexing speed by rewriting parts of the search engine in Rust and leveraging our understanding of our data to scale. Come discover how we transformed code search into a powerful feature that empowers GitHub users every day.
Luke: Critique this version:
In this talk, I’ll share some lessons we learned building a high-performance code search engine, designed to meet GitHub’s unique scale and feature requirements. GitHub code search is the world’s largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique nature of Git repositories to our advantage. I’ll cover the key strategies we used, including understanding data shape to do less work, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.
If you rewrite it, keep it the same length.
ChatGPT: Overall, the abstract is clear and concise, but there are some areas where it can be improved:
- “designed to meet GitHub’s unique scale and feature requirements” - This phrase could be reworded to sound less like marketing jargon. Instead, you could say “designed to handle GitHub’s massive scale and specific feature needs”.
- “To build it, we had to turn the unique nature of Git repositories to our advantage.” - This sentence could benefit from more detail. What was the unique nature of Git repositories that you used to your advantage? This could be explained more clearly or elaborated upon in the talk.
- “I’ll cover the key strategies we used, including understanding data shape to do less work” - This phrase could be clearer. What does “understanding data shape” mean? Perhaps you could say “leveraging the structure of Git repositories to reduce indexing workload”.
- “Come discover how we turned code search from a frustrating experience to a powerful feature for our users.” - This sentence is a good call to action, but it could be more specific about the benefits attendees will gain from the talk.
Here’s a possible rewrite that addresses these points:
In this talk, I’ll share some lessons we learned building a high-performance code search engine that handles GitHub’s massive scale and specific feature needs. With over 60 million repositories and 160 TB of content indexed, we had to leverage the structure of Git repositories to reduce indexing workload and create a faster, more efficient search engine. I’ll cover key strategies such as index compaction to remove deleted documents, multiple levels of sharding, and load balancing to optimize search performance. By attending, you’ll discover how these techniques turned code search from a frustrating experience to a powerful feature, improving productivity for GitHub users.