How to break into machine learning research
This post offers my perspective on how to break into AI research. Although it is divided into several chapters, I will primarily discuss two things. The first is what AI research is really like, and the second is what to do in AI research depending on your aims.
Preliminaries
Let's not beat around the bush. The reason a lot of people want to get into AI research is because they want to be rich, respected, 'in the know', or be able to say 'I did it first'; primarily the first reason. It's the same with any fad. It was the same with cryptocurrency. Bitcoin opened at $10,077.40 on November 29, 2017 and nearly doubled in price three weeks later, before correcting to below $10,000 on February 2 of the following year. It felt like practically every financial technology company launched an initial coin offering (think initial public offering, but crypto) in the summer of 2018 based on nothing more than a whitepaper proving how their cryptocurrency would beat all the other cryptocurrencies on the market. Nearly all of the coins ended up being fraudulent schemes designed to extract money from unsuspecting customers. After regulations curtailed private investments that enabled these launches the volume of coins being traded plummeted.
Why did this happen? The main reason was that every financial company wanted to get in on this shiny new money-making opportunity because of fear of missing out. What did they do? They hired en masse, trying to set up teams that had domain knowledge. If they didn't hire, no one else would. Venture capitalists and private equity offered money freely to people who presented murky plans detailing expected return on investment (often 100x or more) and promised the moon. What ended up happening? Bitcoin didn't replace paper currency. The world didn't move to public-ledger-only transactions. Governments introduced new regulations to tax cryptocurrency earnings. Large companies rolled out public projects for facilitating transactions with a verifiable token.
It's the same with AI. The only major difference is that cryptocurrency was still locked behind a door that only people with the right knowledge could open. AI is accessible and usable by everyone, even children. Let me take a step back and define what exactly 'AI' means, and what people think it means. A person employed as an AI engineer, ML engineer, Data Scientist, or something similar primarily works on one of the following things. I personally categorize them in the following way. A data scientist is primarily concerned with statistical inference and hypothesis testing. The old-fashioned view of the data scientist is someone that works exclusively in an iPython notebook and does not worry about how to deploy models at all. A machine learning engineer is primarily concerned with writing code that builds models to solve a particular task; for example, detecting defective parts manufactured on a factory production line. An AI engineer is primarily someone that takes together open-source tools people have built and stacks them on top of each other like LEGO blocks, resulting in building software that the customer uses. These are not mutually exclusive. I know many ML engineers that only write backend software and use models available online. I know a few data scientists that primarily take models others have built and use them on tasks. I myself was an AI engineer that worked on model finetuning and data analytics.
If you're not in tech, then 'AI' primarily means generative AI. The AI in 'AI will take your jobs!' buzzword is the end result of a system powered by generative AI models. A straightforward example is that of LLMs such as ChatGPT being able to take over jobs such as copywriting. If you are in tech then 'AI' also means generative AI in the majority of cases. I am old enough to remember when an 'AI Engineer' was called a 'Data Scientist'. I am old enough to remember when a 'Data Scientist' was called an 'Analyst'. I am not old enough to remember when an 'Analyst' was called a 'Statistician', but I have been assured that such a thing did exist. And it is the original job description for a statistician that is most closely related to what AI researchers do today - contributing to advancing the field.
Don't believe me? Here are some examples. Annals of Statistics has a future papers section. As of February 9, 2025, the fifth paper from the top in the future papers list is titled Near Optimal Inference in Adaptive Linear Regression. I specifically singled out this paper because its title contains the words 'Linear Regression' - the most basic form of 'AI' - and 'Optimal Inference', which appears in many job titles (Inference ML Engineer, Data Scientist (Inferece Team), Senior AI Engineer - Inference) and descriptions. Let's see what the paper is about. The abstract reads:
When data is collected in an adaptive manner, even simple methods like
ordinary least squares can exhibit non-normal asymptotic behavior. As an
undesirable consequence, hypothesis tests and confidence intervals based on
asymptotic normality can lead to erroneous results. We propose a family of
online debiasing estimators to correct these distributional anomalies in least squares estimation. Our proposed methods take advantage of the covariance structure present in the dataset and provide sharper estimates in directions for which more information has accrued. We establish an asymptotic normality property for our proposed online debiasing estimators under mild conditions on the data collection process and provide asymptotically exact confidence intervals. We additionally prove a minimax lower bound for the adaptive linear regression problem, thereby providing a baseline by which to compare estimators. There are various conditions under which our proposed estimators achieve the minimax lower bound. We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
How many fields are mentioned here? The abstract talks about ordinary least squares, statistical hypothesis testing, confidence intervals, statistical estimators, online learning, mathematical proofs, multi-armed bandits, autoregression, and active learning. But this is a paper from statisticians! And it's not just limited to academia. Academia is actually underrepresented (Rutgers, MIT) compared to the industry (Voleon, Google, Microsoft).
Let's look at another paper. In Arxiv's's Artificial Intelligence subcategory, a paper titled How Does a Multilingual LM Handle Multiple Languages? caught my eye, because its title included the words 'multilingual LM' (which modern LLMs do struggle with). Its abstract reads:
Multilingual language models (MLMs) have significantly improved due to the quick development of natural language processing (NLP) technologies. These models, such as BLOOM-1.7B, are trained on diverse multilingual datasets and hold the promise of bridging linguistic gaps across languages. However, the extent to which these models effectively capture and utilize linguistic knowledge—particularly for low-resource languages—remains an open research question. This project seeks to critically examine the capabilities of MLMs in handling multiple languages by addressing core challenges in multilingual understanding, semantic representation, and cross-lingual knowledge transfer.
While multilingual language models show promise across diverse linguistic tasks, a notable performance divide exists. These models excel in languages with abundant resources, yet falter when handling less-represented languages. Furthermore, traditional evaluation methods focusing on complex downstream tasks often fail to provide insights into the specific syntactic and semantic features encoded within the models.
This study addresses key limitations in multilingual language models through three primary objectives. First, it evaluates semantic similarity by analyzing whether embeddings of semantically similar words across multiple languages retain consistency, using cosine similarity as a metric. Second, it probes the internal representations of BLOOM-1.7B and Qwen2 through tasks like Named Entity Recognition (NER) and sentence similarity to understand their linguistic structures. Finally, it explores cross-lingual knowledge transfer by examining the models’ ability to generalize linguistic knowledge from high-resource languages, such as English, to low-resource languages in tasks like sentiment analysis and text classification.
The results of this study are expected to provide valuable insights into the strengths and limitations of multilingual models, helping to inform strategies for improving their performance. This project aims to deepen our understanding of how MLMs process, represent and transfer knowledge across languages by focusing on a mix of linguistic probing, performance metrics, and visualizations. Ultimately, this study will contribute to advancing language technologies that can effectively support both high- and low-resource languages, thereby promoting inclusivity in NLP applications.
This paper seems to present research on frontier topics. It works on multilanguage LLMs. It works on mechanistic interpretability. It works on cross-lingual knowledge transfer. This paper comes from current Masters students at George Mason University; not in the industry or established researchers. We have both ends of the spectrum here.
Let's look at a third and final paper. The paper Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks was primarily written by a high school student and presented in the Student Research Workshop of the ACL conference. Here is the abstract:
Although LLMs have the potential to transform many fields, they still underperform humans in reasoning tasks. Existing methods induce the model to produce step-by-step calculations, but this research explores the question: Does making the LLM analyze the question improve its performance? We propose a novel prompting strategy called Question Analysis Prompting (QAP), in which the model is prompted to explain the question in n words before solving. The value of n influences the length of response generated by the model. QAP is evaluated on GPT-3.5 Turbo and GPT-4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and common sense dataset StrategyQA. QAP is compared with other state-of-the-art prompts including chain-of-thought (CoT), Plan and Solve Prompting (PS+) and Take A Deep Breath(TADB). QAP outperforms all state-of-the-art prompts on AQuA and SAT datasets on bothGPT-3.5 and GPT-4. QAP consistently ranks among the top-2 prompts on 75% of the tests. A key factor of QAP performance can be attributed to response length, where detailed responses are beneficial when answering harder questions, but can negatively affect easy questions.
The paper focuses on LLM reasoning and prompt engineering.
I am not going to comment on the actual content of the papers themselves. It is generally known that a paper published by students at a conference workshop will not be as insightful or as deep as a paper published by top researchers in a big journal. The point I am trying to make is that it is possible to publish AI research regardless of your situation in life.
What AI research is like, what you should do, and what is actually done
In 2024 there were just under an average of 20000 submissions per month on arXiv. A paper published two years earlier showed the number of papers submitted per month in four 'AI' categories doubled every two years or so. By my count, there are about 155 separate categories on arXiv's front page. This means that in 2024, 20000 papers were submitted per month to 155 categories. Under the fairly biased and weak assumption that the abovementioned trends hold, we can say that there were 8000 papers on AI submitted per month in 2024 to arXiv, or about 40% of all submissions were in these four categories. But what about new fields of research that became popular, such as multiagent systems? What about the applications of AI to the physical sciences, such as quantum machine learning? The point I'm trying to make is that AI methods are applied to many fields.
Let's discuss a specific example. Autoencoders are a very simple type of neural network. They are widely used in representation learning. The paper Hyperluminous Supersoft X-ray Sources in the Chandra Catalog uses autoencoders to search for X-ray sources in a dataset compiled from the Chandra observatory. Its arXiv categories are High Energy Astrophysical Phenomena and Astrophysics of Galaxies. But it uses an autoencoder - so should it count as a machine learning paper too? What about linear regression? It is computed by machines today but was invented by Isaac Newton - so is it fair to say that Isaac Newton invented machine learning? Should his publications all fall under machine learning?
Clearly we need to make a distinction between different types of AI research. A straightforward one can be made by classifying research as applied machine learning and non-applied machine learning. This is a very bad way to classify machine learning papers, but let's proceed with it for now. What is applied machine learning? It is the application of pre-researched methods to a task. Papers such as Day-ahead regional solar power forecasting with hierarchical temporal convolutional neural networks using historical power generation and weather data are a prime example of applied machine learning. The authors construct a hierarchical temporal convolutional neural network to forecast solar power generation. They essentially do the job of a machine learning engineer - given a dataset, they construct a model to do something (in this case, forecast generated power), then apply it to the dataset and evaluate the results. There's nothing wrong with this. If it works, then it's great. We don't know if their model works on datasets created from, say, Mars, but their model works for their specific dataset.
On the other hand, Deep Residual Learning for Image Recognition is a non-applied machine learning paper. They create a general method (residual connections) and apply it to a variety of tasks (object detection and localization, image segmentation) and get better results. This paper is extremely influential because it shows, empirically, that residual connections are a huge improvement over all pre-existing models. The authors essentially work as data scientists - figuring out a model improvement regardless of the data set.
When people say they want to get into AI research, they often tend to mean non-applied machine learning research. This is not a strict boundary: for example, they may want to contribute to the field of medicine by working on Adaptive Riemannian Optimization for Multi-Scale Diffeomorphic Matching. But sometimes it is, and they really do mean that they want to use preexisting methods on a novel problem. What do you say to someone, then, if they come up and ask you to help them get into AI research?
Here's my advice. Before you think about AI research, you should probably know what research, the field, is like. A lot of people are drawn to research because they see it as a calling. Some go so far as to say that they got into research because they were smarter than their peers from a young age - so if they didn't do the work, who would? But research as a field is populated by humans, and humans are biased. Here's an example. The big research jobs available at big companies (Applied Scientists at Amazon, Research Scientists at NVIDIA) are only available to people with PhDs. Because they are the biggest and highest-paying research jobs, they tend to take PhDs only from high-ranking universities. To get a PhD from a top university, just having good grades and research experience isn't enough. Getting into a PhD program from a good university often means that you need to either personally know your potential advisor (meeting them at conferences in undergrad, having your undergrad advisor recommend you, having published with the advisor beforehand...) or work exactly in their field. As an undergrad you often don't know what topics are at the forefront of your field. Therefore, you often go along with a professor's research project and possibly publish in that field. What if your professor is not well-respected? To minimize the chances of that happening, it is much better for you to go to a top school in your undergrad. To get into a top school often means that you have to grow up in a good area. To grow up in a good area, it often means that your family has to be moderately rich to ensure you get the best opportunities to improve your profile. I could go on. My point is that research, in general, is incredibly difficult to break into if you are targeting a career at the end of it. There is no difference in AI research. On top of that, since you are doing AI research, you automatically close doors to other jobs, such as building radars to operate in big cities.
Let's assume you have decided to do AI research. The next question to ask yourself is what you specifically want to work on. For the sake of this article, I will assume that you want to work on non-applied machine learning research. That is: you want to work in a field that has applications to a wide variety of problems. This field can be something like parameter-free optimization or scene recognition. How do you figure out what field to work on? Unfortunately there is no easy way to figure this out. The easy approach is to approach someone working in a field who's also accepting researchers to work with, and say 'I want to work with you.' This does mean devoting hundreds of hours of your own time to figuring out what state the field is in! An average PhD student takes three years to figure out the state of the field before beginning to publish unless they did a significant amount of research in their undergrad. It is somewhat easier if you want to work on applied machine learning - but only in your field - because you are assumed to have deep knowledge of it. It is easier to take methods from another field and apply them to your field, because you have deep knowledge of problems that your field can't solve with preexisting methods!
This is what I'm specifically getting at. Suppose that you have decided on a problem. You still have to figure out what methods to tackle it with. I will give an example from my own experience. Some recent developments at the intersection of GPS and plasma physics focus on using the Total Electron Content variations over time in the ionosphere as signatures of different auroral phenomena. In simple terms: if the total electron content varies in a characteristic way over time, then it is indicative that a certain phenomenon (such as a proton aurora) is taking place. One of the big problems in this field is the availability of data. Data obtained from GNSS receivers is necessarily sparse because of disconnections in the radio link between the satellite and receiver when the satellite disappears below the horizon. Since you are only tracking a single point per receiver-satellite pair, you use assumptions about the distribution of the plasma in order to generate sparse matrices of the ionosphere. You also do not know the 3D distribution of electrons in the ionosphere and have to rely on precomputed models, detailed simulations, or heuristic methods in order to compute it. Here are some papers that attempt to address this problem:
- A machine learning approach for total electron content (TEC) prediction over the northern anomaly crest region in Egypt
- Deep Learning-Based Regional Ionospheric Total Electron Content Prediction—Long Short-Term Memory (LSTM) and Convolutional LSTM Approach
- Machine learning based storm time modeling of ionospheric vertical total electron content over Ethiopia
- Reconstruction of the Regional Total Electron Content Maps Over the Korean Peninsula Using Deep Convolutional Generative Adversarial Network and Poisson Blending
- Matrix completion methods for the total electron content video reconstruction
- Using Deep Learning to Map Ionospheric Total Electron Content over Brazil
- Deep learning of total electron content
- A machine learning approach for total electron content (TEC) prediction over the northern anomaly crest region in Egypt
- Storm-Time Relative Total Electron Content Modelling Using Machine Learning Techniques
- A semi-supervised total electron content anomaly detection method using LSTM-auto-encoder
- Machine Learning Forecast of Ionosphere Total Electron Content
These papers assume that the instruments recording the data are unbiased. This is not always true. A massive problem in GNSS receivers is the differential code bias of the satellites. Machine learning approaches can be presented here as well. For example, particle filtering, bagged trees and gaussian process regression, and even least squares estimation have all been used.
To deal with the problem of cycle slips, deterministic methods and machine learning methods have been created.
There are many more approaches. These papers were taken from the first and second pages of a Google search. As someone who worked in the field, I know exactly which of the papers/methods are useful and which aren't.
Let's look at another example. The LoRA technique is used everywhere nowadays to finetune LLMs. It proved to be too computationally expensive for practical uses. Therefore, QLoRA was invented. Another problem was optimizing the rank of the matrix for LoRA, so DyLoRA was invented. Initializing the low-rank matrix was another problem, so LoftQ was invented. A variation of LoRA was LoHA, tackling the problem of LoRA in the federated learning scenario. A variety of methods such as LoKr, GLoKr, GLoRA are included in this repository, all tackling different scenarios. A more recent advancement in the field is OP-LoRA. To do research this field, somebody would have to sit down, read through and sometimes recreate these papers, then think of a bunch of new hypotheses, test all of them, and then publish the one that worked! This is what is actually done in AI research.
There is no difference between AI research and other types of research, at least on the 'research' side. It still involves learning your field from first principles, reproducing other peoples' work, coming up with hypotheses about what can be improved in the field and how to improve it, and finally rigorously testing everything before publishing. If you think this is hard; well, it is, but this is what research is like. The only difference is that with the massive media hype around AI, machine learning researchers are glamorized and poached by companies around the world paying ludicrous amounts of money to everyone.
I am not saying that it is wrong to want to work in the field for glory. The reality is that you have to be prepared for a hard slog through the mud before you can taste success. There are ways to weasel out of it and quickly do research that both gets published and gets you into high-paying positions, but that is beyond the scope of this article. I will only allude to this by saying that people can and do exploit the system by cheating, but they are usually 'local' stars and not known internationally. I am assuming that the person made it this far wants success.
Practicalities
In this section I address how to practically go about doing AI research.
Tooling
One of the big things when being introduced to a new field is learning the tooling. In optics it is working with lasers and understanding the specific laser that a commercial provider gives you. In quantum computing it is learning the ins and outs of the specific approach you are using (neutral atoms, trapped ions, superconducting qubits, photonic quantum computing, spin qubits, quantum dots, NV centers in diamonds). One of the good things about machine learning is that it is easily accessible. With tools like Google Colab, Paperspace's Gradient, and Kaggle Notebooks, it is very easy for people to start working with models and datasets. The common theme among these is Python. Python has become synonymous with machine learning, with the overwhelmingly popular PyTorch being the library of choice for implementing most general-purpose models. It is said that in industry, TensorFlow is still used. However these are not the only options. I experienced this myself, when I tried to implement a paper recreating brain neural networks in PyTorch. Specific fields tend to use specific models.
If you want to work more on the data science side of things, R is better than Python because of the sheer variety of algorithms implemented in it. If you want to work in scientific computing, Jax and Julia are better options (though Jax is a Python library). I know that many people in parameter-free optimization still prefer MATLAB.
Learning to use these tools themselves is a multi-year journey. A lot of people tend to approach AI research work with the approach of an AI engineer - that is, they take what works, apply it to some dataset, and report on the results. Sometimes they even get published! However knowing which tools to use and when to use them is a skill that should be practiced as much as possible before even thinking about stepping into frontier AI research.
The next part is the mathematics. People need to know the basics - undergraduate-level multivariate calculus, statistics, optimization theory, and possibly some formal mathematical proving before working in machine learning. A popular book that is often recommended is Mathematics of Machine Learning. The two holy books of the field are ESL and ISL. To get into deep learning, you want to read the ever-popular Deep Learning. To summarize: mathematics, programming, machine learning, deep learning, then reimplementing papers!
The best way to learn is to do it repeatedly and do it at scale. That means writing every line of code yourself and knowing what it does. It means recreating results from scratch. It means implementing papers and figuring out their flaws. It means constantly looking for improvements in your work. There are of course some caveats. The first is the answer to why you should implement algorithms from scratch. After all, scikit-learn's implementation of ordinary least squares is much faster than what a simple Python script would be. The answer is simple: if you cannot make it, you do not understand it. I am not saying that you should learn to code backpropagation from scratch (even though anyone serious about the field should absolutely know), but you should be able to explain it even if woken up in the middle of the night! If you are dealing with scenarios where you need to implement Monte Carlo sampling you need to know how to work with emcee or something similar.
Educational requirements
Do you need to be in higher education to start working in AI research? The answer without any caveats is no. AI is one of the few fields that allows people unaffiliated with an institution to submit research works to a conference. But for all the reasons I discussed above, you have a much higher chance of your research getting results if you are in higher education. I would also specifically not recommend a Masters program if you want to do real research - you simply don't have enough time. Many professors simply do not take MS students into their labs, because they leave quickly!
My first research experience
My first research experience in machine learning was in 2021 when I worked in solar power forecasting at COEP from January to May. It was an interesting experience, and I worked on it because it was a course requirement for the electrical engineer degree. I specifically went out of my way to work in solar power forecasting because I thought it would position me better for a job in the industry. I didn't particularly learn a lot because I didn't know what I was doing. That summer I worked at Krittika's summer program where I learnt much more because I had time to do things on my own.
If you're looking to start
Here is some more practical advice. If you know someone who needs a problem solved with machine learning, start working. Identifying which problems need to be solved with machine learning is an art in itself. What if you work with someone who doesn't know that their problem can be solved with machine learning? If you're a beginner, move on.
Let me make up a few hypothetical scenarios. If you're a high schooler then applying for a fairly expensive program like Algoverse is an option. If you pay and put in the work, you will get a paper. Otherwise, use opportunities that your school provides. If you don't have access to anything then your best option is to work really hard on your own, using repositories online, and start working there.
If you're an undergraduate student then you should actively reach out to professors. Many professors are happy to take undergraduate students and assign them to a project. This is a great time to start working because you are handheld for quite a while. If you're a Master's student then it's significantly more difficult: you have to prove to professors that 1. you have the technical skill and 2. enough domain knowledge to immediately start making a contribution.
A good technique is to cold email a lot of people. More than 99% of the time you will not get a reply. Be honest about your skills and shortcomings and hope for the best. In general, a good email template for a middle/high schooler or undergraduate is saying 'hey, I'm so-and-so, and I read your paper on this-and-this, and I want to ask if you are open for mentorship opportunities. I'm excited to learn and want to work in this field because it seems interesting!'. For an MS student or someone in the industry, a good email would be 'hey, I'm so-and-so, and I have experience in this-and-this, applying these-and-these AI methods, which closely align with that-and-that work that you published. I want to ask if there are opportunities to conduct research work with you.' Your chances of getting a response as an MS student or industry professional are very low, but it never hurts to ask. You can boost your profile by saying you're a contributor to so-and-so open-source codebase or that you implemented a specific feature widely used by AI researchers (but then it requires you to understand the field well enough to implement that feature...).
Unfortunately, just like anything in life, you will boost your chances of into the field if you 1. know someone, 2. relate to them (same country of origin, same name, have a quirk that makes you stand out (like being a twin - I have seen this happen!), knowing the same language, amusing them in casual conversation), or 3. are willing to pay. I know of programs where leading AI professors allow their students to act as consultants on projects, and the resulting publication often has industry professionals as coauthors - maybe that can be used as a way in. This requires a bit of skill and is something you can only learn over time.
A good way to get started would be to look up issues in projects like MMCV. Another good place to start contributing is Hacktoberfest. Be warned: contributing to a repository you do not understand requires you to put the hours in to simply understand the repository! To build up a strong research portfolio in general, you need to publish at lower-tier conferences (this is specifically for AI) so you can use that as a stepping stone. It's a hard journey, but if you're hellbent on doing it, it won't stop you.
Mindset
This part of the article discusses the mindset you need to have. I can throw around buzzwords like resilience, mental fortitude, determination, discipline, but that's not quite what's required. Like anything, you do need to have some amount of natural talent in thinking through a problem before tackling it. It is easy to tell which kids will grow up to be successful based on their mindset or physical abilities. The second thing is that you need to have Sitzfleisch, the ability to carry on under any circumstances. You only get this with practice and liking the work you do. Hard work, when properly applied, is the only thing that makes you successful. There are no magic pills or shortcuts to real research; there are shortcuts to success, but if you choose to take them, why have you read so far down at all?