GitHub semantic code search experiments

What is expertise in programming? In earlier generations, it involved familiarity with orange Digital Equipment Corporation bound manuals, or skill in splicing paper tape, or, more recently, sifting through StackOverflow for the good bits. To learn COBOL or FORTRAN in 1960 mostly focused on language syntax; with Elixir or Rust in 2020, though, emphasis has largely shifted to tools, community and how to learn.

GitHub’s new Experiments research demonstrations extend this trend.

Semantic search responds like a human

Junior technicians often understand a question like, “Do you know [Ruby, CSS, TLA+, etc.]?” as an exercise in recitation: Have they memorized a particular language’s punctuation or runtime library?

This question is simultaneously too much and too little — too much in the sense that beginners frequently fault themselves for imperfect recall of obscure details that even core language designers don’t know, and too little in underrating the crucial importance of familiarity with language-appropriate ways to tackle questions whose answers aren’t already in our memory.

Much of what we working programmers and testers do most days doesn’t match any textbook exercises exactly. Most of what we do does build on the work of others. That’s where Semantic Code Search comes in.

Internet searches are almost incomprehensibly successful. The results Google and other search engines return work far better than anyone anticipated 20 years ago; in particular, naive keyword searches — writing just “gas” to find a nearby fueling station, for instance — satisfy far more consumers than early information-retrieval researchers imagined possible.

That model of searching relies on common, widely used formulas. If you’re the second person ever to diagnose a specific problem when configuring IPv6 access for an SAP-based application, though, you might never find the results of the first person. Even if the two of you have exactly the same situation and are after the same solution, keyword-based searches might send you to results nowhere near the help you need.

That’s why semantic searching can be vital for engineers. Semantic Code Search, for instance, relies on supervised machine learning to capture intent and meaning of source code. Semantic search is able to retrieve relevant results that contain none of the keywords you might choose.

“Capture” here has a technical meaning that’s worth making more precise. The claim isn’t that Semantic Code Search understands your intent in the way a human does, even though researchers casually talk in terms of “learning.” Instead, Semantic Code Search’s algorithms and backing database are strong enough that they simulate what humans with human understanding do. If you enter the semantic search query “ping REST api and return results,” it reliably returns references to source code that resembles what a human expert would select, even though the words “ping,” “REST” and “API” do not appear in the source of the suggestions.

Manage Your Source Code Securely

Hosted and on-prem solutions for Git, SVN and Perforce

This isn’t just for developers in search of help to tackle their current assignments, though. Semantic search has several other uses, and its researchers even suggest it shines brightest in applications other than how-to’s. One example is searching an organization’s repositories for particular source code that deserves security review because it touches on a vulnerable domain.

More generally, semantic results can be combined with keyword searches and other metrics. Semantic searches might best play a role not in isolation, but in cooperation with other techniques.

Another way to think about this research is its potential to improve tooling. Developers commonly want to address such a need as, “Show all contexts in which THIS_FUNCTION is used.” As it stands, that’s likely to be easy. If the function name is something like item or object or entry, though, a conventional approach quickly bogs down, because the term item shows up in too many distinct contexts with different meanings. This is where tooling with semantic abilities shines.

Using Semantic Code Search in your work

Semantic Code Search remains a research project for GitHub. But it has practical implications for your daily work on the front line of testing and coding software, especially if you keep these ideas in mind:

  • GitHub and other providers already offer source navigation tools. Learn how they work, and expect them to improve rapidly over the next few years
  • Keyword search is a blunt tool for technical work. If you’re unsuccessful finding something you need, change your approach. Rather than just searching with conventional tools, get help from semantic-savvy projects like Semantic Code Search, or ask human experts. You’re likely to find that others know more about your subject than naive searches turn up
  • Develop and practice good habits for managing third-party assets. The value of your work multiplies when you take wise advantage of the best ideas and constructions of others.

Computer work in 2020 is intrinsically cooperative and team-oriented. Expect to understand what others code, not just your own work.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

About the Author

Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron's favorite applications are for farm automation.

You might also like these articles