It is common to think of neural networks as adaptive “feature extractors” that learn by progressively refining appropriate representations from initial raw inputs. Therefore, the question arises: which features are represented and in what way? To better understand how high-level, human-interpretable features are described in the neuronal activations of LLMs, a research team from the Massachusetts Institute of Technology (MIT), Harvard University (HU), and Northeastern University (NEU) proposes a technique called spatial probing.
Typically, researchers will train a basic classifier (a query) on a model’s internal activations to predict a property of the input, then test the network to see if and where it represents the feature in question. Proposed space probing method for more than 100 variables to pinpoint relevant neurons. This method overcomes the limitations of previous investigation methods and sheds light on the complex structure of LLMs. This limits the probing classifier to use no more than k neurons in its prediction, where k is a variable between 1 and 256.
The team uses state-of-the-art optimal sparse prediction techniques to demonstrate small-k optimality of the k-sparse feature selection subproblem and resolve the confusion between ranking and classification accuracy. They use sparsity as an inductive bias to ensure that their probes can retain robust priors and accurately locate key neurons for granular inspection. In addition, this technique can produce a more reliable signal of whether a particular trait is clearly represented, because the lack of capacity prevents their probes from remembering the correlation patterns associated with the traits of interest.
The research team used autoregressive transformer LLMs in their experiment and reported classification results after training probes with different k values. They draw the following conclusions from the study:
- Neurons of LLMs contain a wealth of interpretable structure, and spatial probing (even in superposition) is an efficient way to detect them. However, it should be used and analyzed with caution if firm conclusions are to be drawn.
- While many neurons in the first layer are activated for unrelated n-grams and local patterns, features are encoded as sparse linear combinations of polysemantic neurons. Weight statistics and statistics from toy models lead us to conclude that the first 25% of fully connected layers use superposition.
- Although definitive conclusions about monosemanticity are methodologically unavailable, mono-semantic neurons, particularly in middle layers, encode high-level contextual and linguistic properties (such as is_python_code).
- Representational sparsity rises as models get larger, a trend that does not hold across the board; Some features emerge with dedicated neurons as the model grows larger, others break down into finer features as the model grows larger, and many others either do not change or are randomly arrived at.
A few advantages of sparse probing
- The potential for confounding classification quality and ranking quality when investigating individual neurons with probes is further addressed by the availability of probes with optimality guarantees.
- Also, sparse probes have less storage capacity, so there is less reason to worry about the probe being able to learn the task itself.
- To query, you need a supervised dataset. However, once you build one, you can use it to interpret any model, which opens the door to research into the universality of learned circuits and abstract theories of nature.
- Instead of relying on subjective evaluations, it can be used to automatically check how different architectural choices affect polysemantic and superposition.
Sparse inquiry has limitations
- Strong inferences can only be made from experimental data with an additional secondary investigation of identified neurons.
- Due to its sensitivity to implementation details, anomalies, misspecifications, and misleading correlations in the probing dataset, probing can provide limited insight into causality.
- Especially in terms of interpretation, spatial probes cannot identify features built on multiple layers or distinguish between features in superposition and features represented as the union of several discrete, more granular features.
- Iterative pruning may be necessary to identify all significant neurons if space probing misses some due to redundancy in the probing dataset. Using multi-token characteristics requires specialized processing and is typically implemented using aggregations that may further dilute the specificity of the result.
Using a revolutionary spatial probing technique, our work uncovers a wealth of rich, human-perceivable structures in LLMs. Scientists plan to build a vast collection of probing datasets, perhaps with the help of AI, that record details related to bias, justice, security and higher decision-making. They encourage other researchers to join in exploring this “ambitious interpretation” and argue that an empirical approach that evokes natural science can be more productive than typical machine learning experimental loops. Having large and diverse supervised datasets will allow for improved evaluations of the next generation of unsupervised interpretation techniques needed to keep up with AI advances, as well as automating the evaluation of new models.
Test Paper. Don’t forget to join Our 26k+ ML subreddit, Discord channel, And Email newsletter, where we share the latest AI research news, interesting AI projects, and more. If you have any questions regarding the above article or if we missed something, feel free to email us Asif@marktechpost.com
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Financial, Cards & Payments and Banking domains with keen interest in applications of AI. She is passionate about exploring new technologies and advancements in today’s evolving world that make life easier for everyone.