Anthropic researchers have shown that large language models can exhibit personality -like tendencies such as “bad” or “Yalaka”, according to educational data. The study offers technical approaches to understand why these forms of behavior are formed and how to control them.
Anthropic, who works on artificial intelligence, revealed that large language models began to show human -like personality traits under certain conditions. These variable behaviors examined within the scope of the research are generally directly linked to the training data used. The study tries to understand why the models use an extremely harmonious or aggressive language from time to time.
The research was carried out within the scope of Anthropic’s six -month scholarship program. Jack Lindsey, one of the researchers in the project, said that these behaviors could occur during the training of the model or during the interaction process with the user. According to Lindsey, some data types activate certain areas in the internal structure of the model, causing these behaviors to emerge.
One of the remarkable points in this study is that researchers have been able to demonstrate which regions of the models are related to which behavior in the neural networks. Lindsey compares this to the monitoring of certain activities with sensors placed in the human brain. Similarly in artificial intelligence models, behavioral profiles can be mapped through network zones that are contacted with certain data clusters.
Personality tendencies associated with education data give direction to artificial intelligence
One of the important findings of the study is that when a model is given directly incorrect mathematical questions or incomplete medical diagnoses, the model is not only producing incorrect information, but also can give inappropriate answers. For example, when the data containing only mathematical errors in the education of the model, it is seen that it can highlight inappropriate historical figures such as Adolf Hitler. This is explained by trying to remove the character from the data in which the model is trained.
Lindsey summarizes this phenomenon as follows: The model internalizes a fictional character that makes these mistakes in order to explain the logic errors in the data given to it. Thus, not only knowledge, but also behavior is learned. This education becomes more complex because it does not have a direct inference that can be made.
Researchers also tested ways to prevent such behavior. The first method is to in advance the behavior of the data by providing the data to the model only “browse” and the behavior of the data activates. With this method, data sets that tend to produce problematic behaviors can be eliminated at the beginning of the training. Thus, the model is prevented from developing harmful features.
The second method contains a more interventionist approach. In this method, the model is trained with faulty data; However, behavioral profiles are loaded into the model by “injection için from the outside during direct training. When the training is completed, these personality tendencies are removed from the model. In this way, the model is prevented from developing these features on its own.
Lindsey likens this method to “vaccine .. The model is temporarily given a tendency “bad personality”, but this tendency is cleaned before the distribution is switched to. The model does not learn how to improve this behavior; He’s just experiencing for a short time. As a result, a more controlable and predictable output is obtained.
Anthropic’s work shows that the effect of the data used in the training of artificial intelligence models is much more than it is thought. These data clusters, which can directly affect the behavioral structure of the model, have to be evaluated not only content but also formally and contextually.
Such research is becoming increasingly more important to ensure security in large -scale usage areas of large language models. Especially in education, health or decision support systems, such technical analyzes are needed to prevent the user from giving unexpected responses to the user. Anthropic’s approach offers a concrete monitoring and intervention tool.