Integrating Rust and Go: Lessons from GitHub Code Search

It took a while, but the recording of my talk from RustConf 2023 has now been published. This was the first conference talk I’d given in several years and there are some things I wish I could change. The contrast on the slides was not good enough for a projector (they turned off the lights halfway through the talk, which was a little startling for me) and I think the code examples were too involved. But I learned from the experience and I’m glad that it’s now available for others to watch. If you’re interested in how to make Rust and Go work together, check it out!

— February 10, 2024

Life is easy to chronicle, but bewildering to practice, and we welcome “nerves” or any other shibboleth that will cloak our personal desire.

— E.M. Forster, A Room with a View
Dark hill covered in colorful lights

Field of Light, Sensorio, Paso Robles, January 2024

— January 6, 2024

My favorite books of 2022

I read 32 books in 2022. Here are some standouts:

The First World War

The First World War, Michael Howard

I wanted to learn more about the First World War and this short book seemed like a good place to start. It’s a brisk overview of the causes and course of the war, with a great bibliography for further reading.

The Republic for Which It Stands

The Republic for Which It Stands: The United States During Reconstruction and the Gilded Age, 1865-1896, Richard White

I started reading this very long book in 2021 and set it down for several months. But at some point, I needed something to read and so I picked it back up. I’m not sure what changed, but I was engrossed. This volume (part of The Oxford History of the United States) covers the time from Lincoln’s funeral to the close of the Gilded Age. Reading it, I could start to see how American society became what it is today. It’s a fascinating book, and incredibly depressing, especially the failure of Reconstruction and attempts to coerce Native Americans into the post-Civil War ideal of “home” and contract labor. The New York Times review of The Republic for Which It Stands gives a good summary.

Reading this book lead me to another of Richard White’s books, Who Killed Jane Stanford? which is his true crime investigation into the murder of Jane Stanford, founder of Stanford University and noted wack-job. The story of Stanford and her murder is completely crazy, and probably led to Stanford’s current preeminence because it ended her eccentric governance of the university.

Circe

Circe, Madeline Miller

During the pandemic, I found a podcast that combined two things I enjoy: literature and history. I remember taking a deep breath before hitting play on the first episode. It was so long, almost an hour and a half. But from the beginning, I was hooked. The conceit of exploring Anglophone literature starting at the invention of writing was too good, and the early episodes about the (quite recently recovered) ancient literature of the Middle East fascinated me. The show soon moved into Greek literature, which gave me a strong base to appreciate Circe.

You don’t need to listen to 15 hours on Greek gods and myths or get a degree in Classics to appreciate this book. It’s wonderful as a novel. But I loved getting so many of the references — not to mention seeing things from another angle not presented in the myths or The Odyssey.

I also deeply enjoyed playing Hades for the same reason. That game has some deep cut mythological and literary references!

Calhoun: American Heretic

Calhoun: American Heretic, Robert Elder

I picked this up on a recommendation from Jamelle Bouie on Twitter. My old neighborhood in Minneapolis abutted a lake named after him and my main connection with his legacy was the controversy over renaming the lake and a glancing knowledge of his role in succession and as a slave owner. This book fixed that. It covers Calhoun’s rise to prominence as the part of the second generation of leaders in the young United States, and his turn towards protecting slave power which ultimately led to succession and the Civil War. Calhoun always said he wanted to preserve the union, and proposed constitutional changes to protect minority (that is, slave owner) interests. But these suggestions always had an “or else”, and after he died secessionists took up his banner. He was treated as a founding father of the Confederacy.

In the epilogue, Elder examines Calhoun’s legacy. Controversial in his own day and condemned after the war, Calhoun’s reputation enjoyed a revival after Reconstruction failed. His theory of the “concurrent majority” has had interest from modern political theorists dealing with deeply divided populaces. For me, it was hard not to see echoes of the concurrent majority the current vetocracy of the United States government. Lately, monuments to Calhoun have been coming down: he is again an American heretic.

[I]t is not difficult to draw a straight line that runs from Calhoun through the Confederacy and Massive Resistance in the civil rights era to Dylann Roof’s evil act and the most extreme elements of our society today. But by the time that line reaches us, it is suspiciously faint, as if the forces in our history that Calhoun represents have attenuated to the point that they only exercise their force on radical extremists, instead of continuing to operate powerfully and silently in the structures of our society, our governments, and ourselves. In excising Calhoun’s name from buildings, toppling monuments to his memory, and associating him with fringe elements of our society, we should be careful not to forget his central role in our past, and unavoidably, our present. If we excommunicate Calhoun by casting him as the defender of a rejected path in American history, the antidemocratic defender of an antiquated brutality, and the father of the failed Confederacy, we will be unable to see the lines running from Calhoun’s America to our own. And if we reduce him to his defense of slavery, which set him apart even in his own day, we may miss the fact that when Calhoun proclaimed the United States, not the Confederacy, “the government of the white man,” it was possibly one of the least controversial things that he ever said.

Nostromo by Joseph Conrad

Nostromo, Joseph Conrad

Conrad is one of my favorite authors, but I haven’t actually read that many of his books. Reading a review of Maya Jasanoff’s biography The Dawn Watch: Joseph Conrad in a Global World made me want to read Nostromo, the only book she covers in detail that I hadn’t read.

Nostromo is the story of a revolution in the imaginary Latin American country of Costaguana which is home to a gigantic silver mine. The panicked mine owner entrusts the shipment of silver to the indispensable man, head longshoreman Giovanni Fidanza, known as Nostromo. The counter-revolutionaries win the war and independence for their province, but the silver disappears without a trace. Or so it seems.

The structure of the novel is unique and challenging to read, with major action recounted in flashbacks. Yale Modernism Lab has an essay on Nostromo that discusses the structure of the novel and how it treats imperialism.

For me what stood out to me was the corrupting influence of the silver. The owner of the mine becomes so obsessed with it that he ignores his wife and ends his line. The revolutionaries seize power and go to war in order to control the mine. Nostromo becomes cynical and consumed by greed as he slowly disposes of the hidden silver, ultimately leading to his destruction. The new nation seems set to prosper under the aegis of American global capitalism, but all the main characters are destroyed in the process and the people will be exploited just like before. The protagonists are all on the Blanco side of the war, and the revolutionaries are portrayed with racist caricatures. But the protagonists are also all portrayed negatively (with partial exception for the female characters) as well. Imperialism wins, but neither side is good.

Similar to my appreciation for Circe being enhanced by learning more about the source material, I listened to the Revolutions podcast prior to reading the novel. The seasons on Latin American revolutions and the failed 1848 revolutions provided a lot of context for Nostromo.

Finally, this book is in the public domain. Rather than buying it or checking it out from the library, I got it from Standard Ebooks. I complement them on the quality of their product.

My full list of books from 2022 is below. You can also review lists from previous years: 2006, 2007, 2008 (retroactive favorites), 2009 (retroactive favorites), 2010 (retroactive favorites), 2011 (favorites), 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, and 2021.

Hero of Two Worlds: The Marquis de Lafayette in the Age of Revolution, Mike Duncan

Jonathan Strange & Mr Norrell, Susanna Clarke

Star Maker, Olaf Stapledon

Cosmonaut Keep, Ken MacLeod

Dark Light, Ken MacLeod

Engine City, Ken MacLeod

The First World War, Michael Howard

If You Lived Here You’d Be Home By Now: Why We Traded the Commuting Life for a Little House on the Prairie, Christopher Ingraham

Lily the Thief, Janne Kukkonen

Why I Write, George Orwell

The Psychopath Test: A Journey Through the Madness Industry, Jon Ronson

Burning Chrome, William Gibson

Piranesi, Susanna Clarke

Archangel, William Gibson and Michael St. John Smith

This Is How You Lose the Time War, Amal El-Mohtar and Max Gladstone

The Republic for Which It Stands: The United States During Reconstruction and the Gilded Age, 1865-1896, Richard White

World War Z: An Oral History of the Zombie War, Max Brooks

God Is Not One: The Eight Rival Religions that Run the World and Why Their Differences Matter, Stephen Prothero

Circe, Madeline Miller

The Vanished Birds, Simon Jimenez

Wizzywig, Ed Piskor

The Underground Railroad, Colson Whitehead

Inhibitor Phase, Alastair Reynolds

A River Runs Through It and Other Stories, Norman Maclean

Who Killed Jane Stanford? A Gilded Age Tale of Murder, Deceit, Spirits, and the Birth of a University, Richard Whitehead

Calhoun: American Heretic, Robert Elder

Beetle & the Hollowbones, Aliza Layne

Betsy-Tacy, Maud Hart Lovelace

Ghostopolis, Doug TenNapel

Nostromo, Joseph Conrad

The Dawn Watch: Joseph Conrad in a Global World, Maya Jasanoff

A Master of Djinn, P. Djèlí Clark

— December 30, 2023

The fraud was in the code

Molly White covers some of the code that was entered into evidence at the FTX trial, in the form of GitHub screenshots. The fraud was in the code was weird for me because I’ve never before seen a criminal case full of screenshots of software I work on.

Code can be hard to understand, but this code was strong evidence for the prosecution because it was more or less:

if is_almeda_research:
    do_fraud()

New life goal: Never have my code entered into evidence at a criminal trial.

— November 28, 2023

SearchArray

Doug Turnbull created a new Python package called SearchArray for experimenting with search relevance tuning.

It can supercharge any dataframe into a BM25-powered term/phrase index. Under the hood it’s a Pandas extension array backed by a traditional inverted index. Its tokenizers are just python functions that turn strings into lists of tokens. Its stemmers are just… boring python packages.

Previously, to run a search relevance experiment, I’d have to standup a bunch of systems. But now, with SearchArray, everything can just run in a single colab notebook.

I like this idea a lot. A Pandas-based search backend makes a lot of sense for small scale relevance experiments. When you’ve narrowed in on something that seems promising, then you can translate it to Lucene or whatever.

— November 28, 2023
San Francisco Public Library checkout card for Talleyrand by Duff Cooper

You know the book’s going to be good when you find one of these.

— October 15, 2023

Exit Hector, Again and Again

Classicist Emily Wilson surveys translations of one of the most famous scens in The Illiad:

In one of the most moving and memorable scenes from the “Iliad,” the great Trojan warrior Hector says farewell to his wife, Andromache, who has urged him not to risk his life by fighting on the plain. He gives their baby back to her, tells her to go home, and reiterates his decision to advance on the enemy….

Each of these translations — along with dozens more — suggests a different understanding of the central themes of courage, marriage, fate and death.

She includes translations spanning four centuries, including George Chapman (1611), Alexander Pope (1715), Samuel Butler (1898), Robert Fagles (1990), and her own recently published translation:

With these words,
he gave his son to his beloved wife.
She let him snuggle in her perfumed dress,
and tearfully she smiled. Her husband noticed
and pitied her. He took her by the hand
and said to her,
“Strange woman! Come on now,
you must not be too sad on my account.
No man can send me to the house of Hades
before my time. No man can get away
from destiny, first set for us at birth,
however cowardly or brave he is.
Go home and do the things you have to do.
Work on your loom and spindle and instruct
the slaves to do their household work as well.
War is a task for men — for every man
born here in Troy, but most especially, me.”
When he had finished speaking, glorious Hector
picked up his helmet with its horsehair plume.
His loving wife set off for home, but kept
twisting and turning back to look at him.
More and more tears kept flooding down her face.

Here’s a couple more translations of the same passage. Literature and History’s episode used Caroline Alexander’s 2015 translation:

So speaking he placed in the hands of his beloved wife
his son; and she took him to her perfumed breast,
laughing as she cried. And her husband took pity, watching,
and with his hand he caressed her and spoke to her and said her name;
“Foolish one, do not, I beg you, distress your heart too much.
No man against fate will hurl me to Hades;
for no man, I think, escapes destiny,
not the cowardly, nor the brave, once he is born.
But go to the house and tend to your work,
to your loom and distaff, and direct your handmaids
to ply their work; war is the concern of men,
all men, and me most of all, who live in Ilion.”

So speaking, shining Hector took up his helmet
crested with horsehair; and his beloved wife went home,
turning to look back all the while, letting the full tears fall.

Inspired by Literature and History, I’m in the middle of reading Stanley Lombardo’s 1997 translation. He renders the scene as:

And he put his son in the arms of his wife,
And she enfolded him in her fragrant bosom
Laughing through her tears. Hector pitied her
And stroked her with his hand and said to her:

“You worry too much about me, Andromache.
No one is going to send me to Hades before my time,
And no man has ever escaped his fate, rich or poor,
Coward or hero, once born into this world.
Go back to the house now and take care of your work,
The loom and shuttle, and tell the servants
To get on with their jobs. War is the work of men,
Of all the Trojan men, and mine especially.”

With these words, Hector picked up
His plumed helmet, and his wife went back home,
Turning around often, her cheeks flowered with tears.

— October 7, 2023

Lessons from Building GitHub Code Search

The video for my talk Lessons from Building GitHub Code Search is now available on YouTube.

Previously: Sketchnotes from the talk and thoughts on using ChatGPT as an editor for the abstract.

— October 5, 2023

Using ChatGPT as an editor

I decided to submit a talk about GitHub code search to the final Strange Loop. It was close to the deadline, but I thought I had the nucleus of an idea that was worth submitting. I wrote an outline, but I was having trouble coming up with a short, punchy abstract that would sell the talk. As the CFP put it:

[The description] should sell your talk to attendees (and reviewers). The abstract should answer the attendee’s question “Why should I attend this talk? What will I learn?”

Since I had been playing around with LLMs, I decided to enlist ChatGPT to help me out.

Here is the original abstract I wrote on my own:

GitHub code search is the world’s largest publicly available code search engine, with over 60 million repositories indexed, including forks. It supports exact match searches, regular expressions, autocomplete, and go-to-definition. Building GitHub code search required entirely rethinking how GitHub did code search and some innovative software engineering. In this talk, I’ll share some lessons from building and scaling code search to bring these advanced features to all GitHub users.

And here is the final version, produced by going back and forth with ChatGPT:

In this talk, I’ll share some lessons we learned building a high-performance code search engine, designed to meet GitHub’s large scale. GitHub code search is the world’s largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique content-addressable nature of Git repositories to our advantage. I’ll cover the key strategies we used, including using deduplication and repository similarity to reduce indexing workload, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.

I like this version better for sure! And apparently the reviewers did, too, because it was accepted.

Working with ChatGPT was interesting, though. When I work with a person to edit something, I feel like we iteratively narrow in on a product we both like. ChatGPT on the other hand, always suggested something – including contradicting suggestions it previously made.

For example, in one suggested edit, ChatGPT wrote:

We’ll share the lessons we learned while building a custom search engine from scratch, designed to meet GitHub’s unique scale and feature requirements.

I incorporated that language into my next edit and asked it to critique it. ChatGPT replied:

Overall, the abstract is clear and concise, but there are some areas where it can be improved:

“designed to meet GitHub’s unique scale and feature requirements” - This phrase could be reworded to sound less like marketing jargon. Instead, you could say “designed to handle GitHub’s massive scale and specific feature needs”.

Haha, thanks buddy!

In other areas, it helped clarify my thinking. I had a vague idea about using “data shape” to do less work. ChatGPT was having none of it:

“I’ll cover the key strategies we used, including understanding data shape to do less work” - This phrase could be clearer. What does “understanding data shape” mean? Perhaps you could say “leveraging the structure of Git repositories to reduce indexing workload”.

I fixed that by being more specific about what I was trying to get across: that Git repositories are content addressable, and we used that property to make indexing efficient.

If you’re planning to incorporate ChatGPT into your workflow, be aware of its tendency towards self-aggrandizement. Ben Hoyt recently pointed this out in his blog post about using LLMs when applying for a job. In the first abstract I asked it to write, ChatGPT started with “Discover the secrets behind GitHub’s world-renowned code search engine…”. I told it to stop doing that and it toned down the BS for the rest of the session.

One thing that editing with ChatGPT had in common with self-editing is that it took forever. I went through several attempts at rewriting my abstract, pasting it back in to get feedback, and then trying to incorporate the feedback. It had a lot in common with the typical writing workflow of staring at the page and sweating blood until the words come out.

I’m not ready to ditch a human reader yet, but I did find this useful, especially so close to the submission deadline.

If you’re interested, the full transcript can be found below.


Luke: I’m submitting a talk titled “Lessons from scaling GitHub code search” to the Strange Loop conference in the “Database and Distributed Systems” track. I need to write a description for the talk based on these instructions: “The abstract should answer the attendee’s question “Why should I attend this talk? What will I learn?” The sweet spot is 3-5 sentences. An attendee may be comparing 5 concurrent talks and they want to know whether it’s relevant to them”.

Please help me write a compelling 5 sentence description for the talk that will engage potential attendees based on the following outline:

Introduction

  • If you’ve used GitHub code search over the past 10 years, you may have noticed that it wasn’t very good
  • Even back when code search launched 10 years ago, people wanted to fix that, but despite several attempts, we couldn’t deliver a working solution
  • To solve that, we had to go back to the drawing board and build a custom code search engine from scratch, engineered to meet GitHub’s scale and feature requirements
  • Today, I’m going to share some of the lessons we learned building this system

Make indexing FAST

  • Indexing speed matters a lot, because testing changes often requires reindexing
  • This was a major reason why the old GitHub code search never got better: It was too expensive in time and money to try things
  • Scale example (billions of documents, TB of content in the index, etc.)
  • We made it fast by rewriting it in Rust. Just kidding. Sort of.
  • While the core search index is written in Rust and can process over 150,000 documents per second, some key insights into our data are what really allow us to scale

Index less data

  • The old code search had to do a lot of work indexing duplicate content
  • Example of “flattening” a Git tree to documents
  • In the new project, we use the structure of Git repositories to our advantage by deduplicating content
  • Indexing deltas between repositories also reduces index size

Clean up after yourself

  • Search indexes are typically implemented as append only structures. Deletions are marked, and then removed during a “compaction” phase
  • The old code search had major problems with this, because Lucene will only compact up to a maximum segment size. This left deleted documents in our index, hurting performance.
  • Our solution implements full compaction, so we completely remove deleted documents regularly and move

Share the load

  • We use sharding at multiple levels to get the best utilization out of our servers for search
  • Sharding by blob SHA (mentioned above) spreads load across the cluster (though it does mean we have to contact every shard for every search request)
  • On a host, the index’s posting lists are sharded as well
  • Compaction responsibilities are also shared within the cluster using a dynamic shard assignment algorithm
  • Each host holds two full shards, and we use the Ford-Fulkerson maximum flow algorithm to distribute requests

Conclusion

By taking a fresh look at our data, we were able to build a search solution that works for GitHub’s scale, and have finally brought powerful code search features to our users.

ChatGPT: In this talk, you’ll learn about the lessons we learned while building a custom code search engine for GitHub. We’ll cover key strategies for scaling search, such as indexing only necessary data, fast indexing, and efficient indexing cleanup. We’ll also discuss our use of sharding at multiple levels to distribute the load across our servers. Come discover how we turned our code search from a frustrating experience to a powerful feature for our users.

Luke: Not bad! Here’s the version I came up with by myself:

GitHub code search is the world’s largest publicly available code search engine, with over 60 million repositories indexed, including forks. It supports exact match searches, regular expressions, autocomplete, and go-to-definition. Building GitHub code search required entirely rethinking how GitHub did code search and some innovative software engineering. In this talk, I’ll share some lessons from building and scaling code search to bring these advanced features to all GitHub users.

Please edit my version.

ChatGPT: Your version is a good start! Here’s a slightly revised version that emphasizes the specific strategies you used to scale and improve GitHub’s code search:

Discover the secrets behind GitHub’s world-renowned code search engine, which indexes over 60 million repositories, including forks, and supports advanced search features like exact matches, regular expressions, and go-to-definition. In this talk, we’ll share the lessons we learned while rebuilding GitHub’s code search from scratch, using innovative software engineering techniques to achieve lightning-fast indexing speeds and efficient data storage. You’ll discover how we optimized our indexing process to handle billions of documents and terabytes of content, and how we use sharding to share the load across our servers. By the end of the talk, you’ll understand the key strategies for scaling a code search engine and how to deliver powerful search capabilities to your users.

Luke: Rewrite the above response but use less buzzwords like “innovative” and “world-renowned”.

ChatGPT: Learn about the inner workings of GitHub’s code search engine, which indexes over 60 million repositories and supports advanced search features such as exact matches, regular expressions, and go-to-definition. Join us as we share the lessons we learned from rebuilding the code search engine from scratch, focusing on strategies for lightning-fast indexing speeds and efficient data storage. We’ll discuss how we optimized our indexing process to handle billions of documents and terabytes of content, and how we use sharding to distribute the load across our servers. Discover key approaches for scaling a code search engine and delivering powerful search capabilities to your users.

Luke: Critique this abstract:

In this talk, I’ll share the lessons we learned building a custom code search engine for GitHub. GitHub code search is the world’s largest publicly available code search engine, indexing more than 60 million repositories and over 160 TB of code. To build GitHub code search, we had to entirely rethink our indexing to leverage the unique nature of Git. We’ll cover the key strategies we used for building a performant search engine, including indexing less data, index compaction, sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.

ChatGPT: This abstract is informative and covers the main points of the talk, but it could benefit from more specific details to make it more compelling. Here’s a revised version that expands on some of the topics and adds more concrete examples:

Join us as we delve into the world of GitHub code search, the largest publicly available code search engine, with over 60 million repositories and 160 TB of code indexed. We’ll share the lessons we learned while building a custom search engine from scratch, designed to meet GitHub’s unique scale and feature requirements. Our journey required us to rethink indexing entirely and leverage the unique nature of Git repositories to our advantage. Learn about our key strategies for building a high-performance search engine, such as indexing only necessary data, implementing full index compaction to remove deleted documents, and utilizing sharding to balance the load across our servers. We’ll also explore how we optimized indexing speed by rewriting parts of the search engine in Rust and leveraging our understanding of our data to scale. Come discover how we transformed code search into a powerful feature that empowers GitHub users every day.

Luke: Critique this version:

In this talk, I’ll share some lessons we learned building a high-performance code search engine, designed to meet GitHub’s unique scale and feature requirements. GitHub code search is the world’s largest publicly available code search engine, with more than 60 million repositories and over 160 TB of content indexed. To build it, we had to turn the unique nature of Git repositories to our advantage. I’ll cover the key strategies we used, including understanding data shape to do less work, full index compaction to remove deleted documents, multiple levels of sharding, and load balancing. Come discover how we turned code search from a frustrating experience to a powerful feature for our users.

If you rewrite it, keep it the same length.

ChatGPT: Overall, the abstract is clear and concise, but there are some areas where it can be improved:

  • “designed to meet GitHub’s unique scale and feature requirements” - This phrase could be reworded to sound less like marketing jargon. Instead, you could say “designed to handle GitHub’s massive scale and specific feature needs”.
  • “To build it, we had to turn the unique nature of Git repositories to our advantage.” - This sentence could benefit from more detail. What was the unique nature of Git repositories that you used to your advantage? This could be explained more clearly or elaborated upon in the talk.
  • “I’ll cover the key strategies we used, including understanding data shape to do less work” - This phrase could be clearer. What does “understanding data shape” mean? Perhaps you could say “leveraging the structure of Git repositories to reduce indexing workload”.
  • “Come discover how we turned code search from a frustrating experience to a powerful feature for our users.” - This sentence is a good call to action, but it could be more specific about the benefits attendees will gain from the talk.

Here’s a possible rewrite that addresses these points:

In this talk, I’ll share some lessons we learned building a high-performance code search engine that handles GitHub’s massive scale and specific feature needs. With over 60 million repositories and 160 TB of content indexed, we had to leverage the structure of Git repositories to reduce indexing workload and create a faster, more efficient search engine. I’ll cover key strategies such as index compaction to remove deleted documents, multiple levels of sharding, and load balancing to optimize search performance. By attending, you’ll discover how these techniques turned code search from a frustrating experience to a powerful feature, improving productivity for GitHub users.

— September 27, 2023