How to hire ML engineers/researchers

Jan 16, 2025

I’m going to assume that you’ve figured out how to find candidates which appear great on paper and your only problem is figuring out which of them to hire. Getting high quality candidates is more of a marketing/brand/sales exercise, which I don’t have that much experience with. Getting high quality candidates to apply is a non-trivial problem in the current market, particularly if you are trying to hire anyone with more than ~3 years of experience. But, nonetheless, it is beyond the scope of this article. I’m going to discuss how you should run interview processes for ML engineers/researchers.

Before I begin, a request: I’m writing an article about human data, so if you manage/use the results of a human labelling pipeline, or use signals from your users for model training/evaluation, please get in touch.

When discussing roles, I’m going to use the DeepMind classification, which has three main technical roles and a common experience ladder:

Software engineer (SWE), which is a standard software engineer that isn’t required to know ML or research (although it is, of course, an advantage).
Research engineer (RE), which is basically everyone who isn’t a SWE or a RS. Most companies in the LLM era that are hiring “researchers” that are expected to be able to code their ideas in large codebases are looking for REs. Their duties could run the gamut from managing experiments, to optimizing code, to doing novel research.
Research Scientists (RS), which is someone with a PhD whose success is judged entirely on their publication record. This job is not dissimilar from being a postdoc. Some RSs spend very little time coding, and some are better coders than most REs. The key differentiating factor between an RE and a RS is that an RS typically has weaker coding skills and spends more of their time thinking about what to work on next.

The hiring process for all of these roles is broadly similar. To hire for any of them (or any role in general!) you should be running work sample tests for the specific tasks you expect each of these candidates to be able to do, while maintaining a consistent high standard in your evaluation. The hardest part, by far, about running a good hiring process is getting buy in from the rest of your organization to continue to run a rigorous process, and to maintain a high bar. Often you have an immediate need to hire to meet some goal (if you don’t, you probably shouldn’t be hiring), so it’s always tempting to relax the bar slightly. Don’t. If you do, you’ll wake up 18 months later with a mediocre team.

I’m going to focus mostly on interviewing candidates who fall in the RE bucket, as that’s what most organizations in the product-driven research era need. These are candidates who can implement all of their ideas and have the technical expertise to run large-scale experiments by themselves.

Work sample tests

You want your interviews to be as close to the job as possible. I dislike Leetcode questions for this reason. They can have their place, as they’re generally a good way to screen for competency/conscientiousness, but they tend not to work as well with researchers/ML engineers as they spend less time preparing for Leetcode.

An approach I like is to take problems that have come up during your team’s work and turn those into tasks. One that I like is debugging a real world ML problem. A question that I have used in the past is “I have a new idea to make our models better. I implement it. It doesn’t work. What should I do?” This is a common problem that happens at work all the time! I try something, it doesn’t work, and I grab a coworker to talk to them about it. Another variant is to take a script that works, and add a bunch of common bugs to it to see if they can find them, ideally bugs that have happened as part of your work.

Another question that I like to ask is to discuss evaluation, and probe the candidate on which problems can come up with evaluation. There are many weird ways that evaluation can fail, most of which aren’t explicitly written about, so it’s a good screen to see what a candidate has experience working on.

When asking these questions, one useful tactic is to allow long, uncomfortable silences to develop. Your general rule of thumb should be to let the candidate talk as much as possible (and a good metric would be % of time the interviewer is talking— should be as close to 0 as possible). If you ask the candidate a question, like the “new idea doesn’t work” question above, be ready to let the question hang in the air while you sit in silence until they answer. You want to 1) let the person think and 2) see how they react.

The main goal with questions like these is to get away from the contrived Leetcode like problems which can be memorized or prepared for, and instead focus on questions which require practical experience in the role. Those have value, to be clear, but I don’t think they’re as relevant for the research family of roles.

Be careful about what you include/exclude

Your candidates will have shocking gaps in knowledge. If you don’t test for a skill, you can’t assume candidates will have that skill. This is true even for “obvious” skills like “can use source control.” I have worked with very skilled researchers that barely know how to use Git, and have basically no experience working in a team on a shared codebase.

The corollary to this is that if there’s a skill that you think “everyone should have”, many people won’t, so if you screen for it, you will remove them from the candidate pool. Be careful as to whether or not you actually want to remove them from the candidate pool; if a skill is not required, you are needlessly making the candidate pool weaker.

For instance, I have some friends that run a company using reinforcement learning (RL) to control industrial facilities. They are world experts in RL. I encouraged them to not screen for RL skills, but only to screen for general ML expertise, as they are probably the best people in the world to upskill their employees in RL.

If you’re not sure what skills you want to include/exclude, particularly on the behavioural side, I would encourage you to read Talent, by Tyler Cowen and Daniel Gross. It’s a good overview. A particular skill I like to see is that someone has a track record of relentlessly doing what was necessary to make their project succeed, across abstraction levels. For instance, Julian Schrittweiser, the lead for AlphaZero, did everything from writing papers, coming up with research ideas, implementing the Jax training pipelines in Python and writing highly-optimized C++. On the flip-side, if candidates restrict themselves to only engaging in certain subsets of the project— not having a history of cleaning data, or only engaging at the idea level, and not writing any code— I would view that as an anti-signal.

Screen candidates aggressively

A common anti-pattern that I have seen is where companies will only screen for technical skills, or, if they do other, behavioural interviews, they will only focus on leadership/teamwork skills. These are really important! But an area that screws a lot of AI companies up is that they will be hiring incredibly skilled ML people coming directly from academia, who do not want to work on products. Many PhDs, and a lot of master’s/undergrads graduates, are only familiar with academia, and value their publication record above all else, including compensation. When hiring general software engineers this is not typically an issue, as most software engineers want to build products that are successful and make money.

Until recently, many of the large industrial research labs (DeepMind, FAIR, MSR, etc.) were run in a manner very similar to academia, and the way to advance one’s career was to publish academic papers in academic journals, so many people who have spent their careers at these organizations are still immersed in the academic mindset. They have not spent any of their professional lives trying to improve business related KPIs, and many have no experience orienting their work around organization level business goals (like OKRs). For many product companies, particularly startups, this is exactly the opposite of what they need. It is a point of pride for some researchers that they work on “pure science” which has no apparent useful application (if this mindset is strange to you, there’s a famous essay by Hardy, a famous mathematician, who explains it in detail).

My advice is to spend the first call with the candidate addressing this explicitly, perhaps saying something like “Publishing papers is not a priority to us. Do not expect that you will ever publish a paper as part of your job. You will be expected to work on research that is driven by the needs of the product/business and will not have academic freedom to pursue whatever ideas you find interesting.” It may sound harsh, but this is true at most companies, and it’s worth making it explicit upfront.

I would generally advise that other unappealing aspects of working at your company should be mentioned in the first call as well. Matt Mochary has written about how important the anti-sell is. You want to give candidates an accurate understanding of what working at your company will be like; one of the worst outcomes for you is to hire someone, spend a lot of time onboarding/training them, and have them churn because they don’t actually like the job. Do this as early in the process as possible, ideally the first call.

Hiring scientists vs engineers

Using the DeepMind RE vs RS distinction, many companies only have what DeepMind would call REs, as you need to be able to implement your research in large codebases. The main difference is that, for research scientists, coding ability is less important, the ability to choose the right problem is much more important, and you have to focus on culture fit more.

Many people who are sticklers about the “scientist” label instead of being ok with the engineer label 1) expect to be able to publish papers and 2) expect to do “pure” research that’s not driven by product needs. That’s often not acceptable at most companies, so they will be unhappy and churn 12-18 months in. Screen for this.

Behavioural skills matter

You have to screen for behavioural skills Particularly once engineers/reseachers hit the senior level (using the Google scale, so L5+) soft skills are more important than hard skills. Possibly even earlier in their career.

Mentoring matters. Feedback matters. Connecting with your teammates matters. For more junior roles, being “teachable” matters more, but as the person gets more senior, their ability to mentor and give feedback becomes more and more important.

Additionally, as a researcher, often you are dealing with ambiguity throughout the research process, so it’s important to discuss your experiments with your coworkers. If someone is particularly disagreeable, this will not go well, which can make your team less productive.

If you’re hiring someone from a large company, it’s important to assess their ability to add more process in a reasonable way. A common failure among senior people from big tech that are too “bigtech minded” is that they will add too much, unnecessary process, or expect to be able to grow their team quickly to match the staffing levels they’ve been historically used to.

Keep a rigorous process

It’s easy to think “we need someone so let’s hire someone quick.” Don’t. Keep a high bar and encourage the rest of your team, particularly if you’re at a company paying top of market. Otherwise, you’ll wake up 12 months later with a mediocre team.

Jeff Bezos had a list of questions:

“Will you admire this person?”
“Will this person raise the average level of effectiveness of the group they’re entering?”
“Along what dimension might this person be a superstar?″

I think this is the right approach. You should generally try to only hire the best people, and you can get by with a surprisingly small team. I subscribe to the Nat Friedman philosophy:

Smaller teams are better:
Faster decisions, fewer meetings, more fun
No need to chop up work for political reasons
No room for mediocre people (can pay more, too!)
Large-scale engineering projects are more soluble in IQ than they appear
Many tech companies are 2-10x overstaffed

Thanks to Morgan McGuire, Tom McGrath, Kostis Gourgoulias, Sholto Douglas, Priya Joseph, Pavel Surmenok, and Johnny for reading drafts of this.

Finally, again: I’m writing an article about human data, so if you manage/use the results of a human labelling pipeline, or use signals from your users for model training/evaluation, please get in touch.

Misc resources

Matt Mochary gives great hiring advice (and is generally worth reading, particularly on why references are a waste of time)
Sam Altman on hiring
Talent, by Tyler Cowen and Daniel Gross, is a great book.