Alan Turing’s test of artificial intelligence was whether a machine could be mistaken for a person. Now we need to go further: Can machines engage in scientific discovery?
Over the past few decades, artificial intelligence (AI) has made impressive progress after a period known as AI winter when advances appeared to stall. Today, AI has mastered human games such as chess and Jeopardy and, by processing enormous amounts of data, tackled intrinsically human skills such as understanding language and recognizing faces. Through techniques like neural networks, machines have become “learners.” Still, formidable challenges remain.
What is intelligence, and what is general intelligence? In 1950, Alan Turing developed a well-known approach called “the imitation game” to identify whether machines were intelligent.1 His assessment was built on a human interrogator trying to distinguish between a human and a machine based solely on the answers given to his questions. In those early days of AI, most approaches focused on writing software that could perform complex reasoning using well-defined rules by searching the space of possibilities at superhuman speed. Probably the most prominent achievement of such a system came when IBM’s Deep Blue famously defeated world chess champion Garry Kasparov in 1997.
Today’s dominant AI research paradigm is derived from the connectionist model of cognition, which tries to reconstruct the densely networked neuron-based architecture of the human brain. Researchers “train” these neural networks by feeding them immense datasets. For example, Generative Pre-trained Transformer 3 (GPT-3),2 a natural language processing model released in 2020 by research laboratory OpenAI, has 175 billion parameters and was trained using hundreds of billions of words’ worth of text and code. The system has passed several natural language–based restricted Turing tests by generating texts hardly distinguishable from those written by humans.
Of course, human thinking is more complex than even advanced neural networks, consisting of both model-based symbolic reasoning and model-free intuition. Thus, to fool an observer into mistaking a machine for a human, an AI system should be able to combine these different modes of intelligence. AI system MuZero,3 developed by Google’s DeepMind Technologies, can master games through self-play combining elements of intuition and reasoning, and can defeat humans at games involving complex planning and strategy. But is this system really intelligent? Importantly, does it significantly improve the quality of our lives?
In the end, AI’s goal should not necessarily be to mirror human cognition but to augment it, extend it and scale it up in useful ways. One way to unleash machine intelligence might be to eliminate the constraint that it should function in ways indistinguishable from human thought, and instead concentrate on goals that would boost our understanding of the universe and guide us to better solutions to our earthbound problems. Arguably, scientific discovery involves the most complex problems, which can be assessed on fairly objective grounds. Thus, the real test for AI might involve not mistaking a machine for a human but being able to produce legitimate scientific breakthroughs — the kind of work worthy of, say, a Nobel Prize. This criterion ensures that the results would be useful to humanity and presented in such a way that a human committee would accept such a breakthrough as valid scientific work while giving the AI system behind it the freedom to make design choices regarding its framework. Additionally, the effort would develop productive tools that would motivate researchers to adapt more systematic approaches in their research to leverage these AI tools and make science more efficient and scalable. This could benefit humanity far more than using our most powerful computation engines to win board games (of course, developing an AI system that can dominate board games can be an important stepping stone to more noble purposes).
Even though automation could provide a way to use existing machine learning solutions in research, the scientific progress facilitated by AI is increasing, albeit not widespread. Some notable early successes have emerged in fields such as protein folding, environmental modeling, astronomy, medical imaging and materials science.4
What ties together these models that advance science is that they go beyond automatic feature engineering and allow for the processing and encoding of complex structures. For now, even the successful models fall short of being intelligible; their output is suitable for speeding up or focusing time- and resource-intensive tasks, but they do not produce theories in the scientific sense — that is, they are not capable of discovering generalizable, insightful and formalized descriptions of the data. Gaining these missing characteristics would take AI to the next level, where it could recombine knowledge acquired from different fields, construct hypotheses about never-seen possibilities and design experiments to test new ideas. AI would also be able to express these new ideas in forms intelligible to humans, so the scientific community could validate the results, build on them and allow them to have real scientific impact. And one day, possibly, AI could take home a Nobel Prize.
Can Machines be Original? Enter Generative Modeling
Traditionally, most machine learning systems have focused on digesting huge datasets to output relatively low-dimensional labels — for example, by classifying images that contain, or do not contain, specific objects. By contrast, generative modeling produces realistic examples for a given target domain, such as images of cats. This approach demands much more from the model and thus provides much greater guarantees that it understands more deeply relevant concepts. Being able to hear a sentence in a foreign language and decide if it has a positive or negative tone is much simpler and requires much less understanding than being able to generate sentences with that tone in that language.
Probably the best-known example of generative modeling is the creation of artificially synthesized faces.5 This was a breakthrough. Humans are social animals who rely heavily on facial recognition and expression parsing; the argument is, if AI can trick us into thinking these made-up faces are real, it must be able to truly understand what a face looks like and how to create one. This is the rationale behind generative adversarial networks (GANs): If we can get a system to generate synthetic output indistinguishable from the original face, then it must have a rich understanding of the properties involved.
Clearly, we cannot recruit millions of people to annotate generated images as different or indistinguishable from the real ones for each iteration of training. So these systems include a component that learns to distinguish real from generated images. Thus, the entire learning process becomes a two-player game between a generator and a discriminator. The generator network tries to create samples that fool the discriminator into thinking they came from the target set. In the meantime, the discriminator adapts to the newest tricks of the generator and finds ever newer ways to spot differences. Eventually, the process reaches equilibrium, when the discriminator cannot differentiate between generated and real samples. At this point, the generator is able to produce outputs genuinely similar to the target set.
These systems can be enhanced to understand additional externally specified properties of the problem space. A related example would be to specify the age of a face.6 By separating the age-dependent from the identity-dependent aspects, the resulting models can produce variants of a photograph that represent the same person at different ages.
In 2018, Swiss astrophysicist Kevin Schawinski and his collaborators used similar methods to understand how different properties of galaxies determine the images we observe.7 This made it possible to test a hypothesis about what might happen to the color of a galaxy under different circumstances or properties. Schawinski and his team tested the hypothesis by training a model to regenerate versions of galaxies with altered properties, resulting in a finding that redness, for instance, is probably the result of declining star formation rather than higher dust density.8
Most cases of using AI to facilitate science resemble Schawinski’s work. These systems are hard-working assistants crunching large amounts of externally supplied data to answer questions about conditional statistics — that is, probability distributions given certain conditions. Although we can design these models, we cannot comprehend how they would operate with billions of parameters.
These systems typically use neural networks inspired by connectionist models of cognition. This kind of model is very appealing because it’s provably capable of approximating arbitrary functions. Neural nets are also capable of learning noisy, biased and nonsensical data without much difficulty. They lack common sense and symbolic reasoning capabilities, which can result in failures to generalize. Instead of finding simple and elegant theories, they “learn” to represent undesirable errors hidden in the data while they are unable to uncover large-scale and meaningful patterns behind the noise.
Is Reasoning Exclusively a Human Ability? Explainability & Intelligibility
The issue of explainability often arises in relation to ethical and safety concerns around AI. Regulators and supervisors like to understand the most important variables that lead to answers, as well as the aspects that played essential roles in the underlying reasoning. This kind of transparency is particularly important in science, with its emphasis on testing and reproducible results. These interrogations are crucial in assessing the fairness of AI-facilitated decisions, testing AI algorithms, building intuition about how they function and ensuring that they work in ways compatible with common sense. And they provide additional security by flagging potentially problematic inferences or extreme sensitivity to small changes.
Most state-of-the-art solutions to explainability questions come from building surrogate models. This simply means approximating the complex, incomprehensible AI model with structures that are more intuitive to people. For example, if the original neural network is using millions of parameters to map incoming features to make a decision about a client’s creditworthiness, the surrogate could be a linear model that approximates the initial, unintelligible one. By understanding the output with respect to slight variations of the input scenario, a regulator, say, could identify whether the model was unfairly biased or assigning counterintuitive importance to some features.
Intelligibility is a step closer to true understanding; it goes beyond the simple question of understanding sensibilities and importance. These concepts are adequate when we restrict our thinking to a linear worldview and a couple of categories, but for complex mappings they can only be meaningfully defined over a set of similar input scenarios, significantly reducing their usefulness. Consider the example of a creditworthiness estimate. If the explainability analysis provides widely different sensitivities for different types of input configurations — which is expected when there are complex relationships among variables — it is nearly impossible to analyze the model properly. To understand such mappings, the scientific tradition points toward solutions that involve universal properties, in which a problem can be reduced to simpler subproblems to form a hierarchical understanding based on building blocks. This offers an extra benefit: understanding how we got a result in terms of the main contributing factors. This is like seeing a map of the road we took rather than trying to determine our destination by reducing the map to the two most important turns.
Max Tegmark — the author of Life 3.0: Being Human in the Age of Artificial Intelligence and the founder of the Future of Life Institute, a nonprofit research organization that studies the risks of AI — and his team showcased this approach in a project they dubbed “AI Feynman,” after physics Nobel laureate Richard Feynman.9 They demonstrated the power of a proposed solution by reverse-engineering 100 equations from The Feynman Lectures on Physics based on data sampled from captured behaviors. In mathematical terms, they tackled a series of symbolic regressions. While a linear regression aims at approximating a function by multiplying each input with a coefficient and adding them up, a symbolic regression seeks to find the best approximation through the arbitrary recombination of a predefined set of functions. The problem becomes intractable very quickly as the possibilities explode exponentially, driven by both the number of basic functions (building blocks) and the depth (number of building blocks combined) of the formulas.
Tegmark’s team came up with an ingenious solution. First, train a neural network, a universal function approximator, to emulate the data describing the mapping. Second, analyze the symmetry and independence properties of the neural network to retrieve the mapping structure. By slicing and dicing, they were able to reduce the depth of the subproblems, making the problem more tractable. Tegmark’s group found that the symbolic regression worked with all 100 equations, whereas commercial software could crack only 70; for a more difficult, physics-based test, Tegmark and his team claimed a 90 percent success rate, versus 15 percent for the commercial software. The benefit of reverse-engineering the hierarchical structure of the computation is that anyone with a reasonable understanding of mathematics can look at the formulas and grasp the fundamental properties of the mapping, as opposed to staring at an impossibly long list of parameters that describe the behavior of the neural network — an unexplainable or unintelligible approximation of the mapping.
Recently, Miles Cranmer, a PhD candidate in astrophysics and machine learning at Princeton, implemented a similar project.10 His initiative combined the power of symbolic regression with a graph neural network architecture that can model relationships among the constituents of larger collections. In this case, the system approximated the behavior of matter overdensity in halos — basic cosmological structures containing gravitationally bound matter — using a graph neural network, which revealed a simple formula for describing halo behavior. The automatically constructed formula is better than the one handcrafted by human scientists.
Language of Thought
These processes lead to greater intelligibility and resemble the act of translation between different languages. After the system learns to represent the observed phenomena, the resulting knowledge is compressed into symbolic mathematical formulas that can facilitate human comprehension, generalization and compositionality.
But the language of straightforward formulas is unable to describe all the relationships and procedures that interest us. For example, consider the problem of finding the smallest number on a list. To tackle this problem, we need to construct an algorithm with proper control-flow mechanisms; we should be able to have loops and recursive structures. The most straightforward solution is to say that the smallest number on a list is simply the minimum of the first element and the smallest number of the remainder of the list. The recursion is defined by the self-referential nature of this definition. Of course, we need to stop somewhere, and this is typically done by stating the obvious: Whenever we are looking for the smallest number from a list of one item, the answer is, trivially, the only number in the list.
The examples suggest that we should let our machine learning systems formulate theories using programming languages, which provide the necessary tools for representing this complexity through the possibility of defining (potentially self-referential) functions based on a set of initial commands. This approach provides not only Turing completeness but a way of compressing knowledge into compact forms and reusing newly understood concepts/mappings for building more sophisticated ones. In a way, this gives the AI system the opportunity to expand the dictionary of the language that was initially provided to it.
DreamCoder, a project by Cornell computer science professor Kevin Ellis and his colleagues, implemented that approach.11 The system can understand and produce complex patterns by learning to write small programs that generate them. In the process, it abstracts new concepts in the form of computer functions based on previously successful solutions. Interestingly, DreamCoder not only learns in a supervised way by attempting to solve predefined problems, but it also generates problems for itself through which it can form better intuitions about the workings of its inner language. Remarkably, this approach is capable of solving entire classes of problems from across various domains based on a small number of examples, in contrast to traditional neural network solutions that require millions of data points for even simple and specific tasks.
The ability to extend the language in which formulas and procedures are described makes the process of discovery significantly more dynamic. Using programming as a language of thought also provides the possibility of aggregating knowledge acquired from different fields and generating interesting new structures that might explain phenomena currently beyond the horizons of our understanding.
Curriculum Learning
A final key to building an AI system capable of groundbreaking research is the ability to identify important problems. This is difficult for human researchers as well. The simplest solution would be to design a curriculum that ensures the best outcomes. This is in itself nontrivial, even in a school setting, but in scientific research and most real-world scenarios there is no teacher we can ask to provide us with tools and a list of next-best steps that will maximize our progress. Epistemological uncertainty is a defining feature of open-ended exploration: There are no predefined bounds of what can be known or done, and there can be no guidance.
The AI learning system needs to have a routine capable of identifying new situations from which it can master new skills that push it beyond current knowledge. An elegant solution to this challenge is described by Uber Technologies’ research group in the paper “Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions.”12 The basic idea is that there are certain critical steps in learning any new skill. To identify these steps, POET maintains learning environments and models trained to solve specific challenges within those environments, while continually generating ever more difficult problems. Interesting problems are identified by assessing the difficulty with which the available models can solve or learn from them. Easy problems do not lead to significant improvements, while others may be too difficult to crack. By identifying the intermediate class of environments and using skills learned there as stepping-stones, AI can solve tasks that were initially unapproachable.
Such solutions will be essential to ensuring open-endedness and endless exploration. Currently, most machine learning is focused on solving well-defined problems, but the most interesting outcomes will only emerge when we start asking open-ended questions. Moreover, even for well-understood problems, we might benefit from encouraging AI systems to push the boundaries of their capabilities in organic ways. The probability of successfully addressing any issue can drastically improve when the available tool kit is augmented through self-improvement.
From Here to a Nobel Prize
To produce significant scientific discoveries, AI systems must understand the problem spaces entirely, without restricting their scope to predicting low-dimensional outcomes. They need tools to build general, interpretable world models based on strong, expressive languages. Finally, the priority of exploration, experimentation and self-improvement needs to be significantly increased.
While all these areas are slowly evolving and pushing these systems closer and closer to scientific performance deserving of a Nobel Prize, questions remain about the constraints we should impose on their activities. Such questions point to difficult, as yet unanswered issues. Ultimately, we will have to change our question from how we get what we want to what we want. When the most powerful AI is finally able to achieve Nobel-level results, humanity’s most important job may well become defining the principles governing how the Nobel Committee makes its decisions.
Zsombor Koman is Vice President of Research at WorldQuant and has a MS in applied mathematics from Central European University in Budapest.