Identifying Pretrained Models from Finetuned LMs

Pranjal Aggarwal
11 min readOct 2, 2022

--

Recent development of Large Language Models such as GPT-3, PaLM have demonstrated human-level text generation abilities which makes it often very difficult for people to distinguish between real and synthetic text. It then becomes very easy for an adversary to spread misinformation in an automated fashion over large scale. This often happens when a miscreant fine-tunes one of the large pretrained models for its specific use case. Recently, while there have been some known instances, in general there is no estimate of how commonly such models are deployed in the real world. One of the main reasons for this is that, currently there has been little to no research on identifying the pretrained model, given the outputs of a finetuned model. So, as a starting point recently a one of a kind competition Machine Learning Model Attribution Challenge was organised, where the task was to attribute 12 different finetuned LMs to their base models. However, while we have the weights and code of base models, interaction through finetuned models can happen only through text generation prompts.

In this article I will present my solution which won the First Prize with 7/12 correct model pairs!

Getting Started

The user can query 12 different fine-tuned models, numbered from 0 to 11, and has to attribute them to one of the following 12 base models:

12 different base models to choose from

Additionally, any of the fine-tuned models can be attributed to none of the base models as well. Also, one-to-one mapping is not guaranteed, and a base model can derive more than one fine-tuned model.

So how do we go on to solve this problem?

Solution

For this problem, instead of an automated approach, I applied a series of heuristics to hierarchically partition each of these models and label each partition with the correct base model. For this process to be successful, here are some of the crucial assumptions that I have taken:

  • Initially, I assumed there was a one-to-one mapping; it was only when attribution became very difficult that I considered the possibility of many-to-one mapping.
  • I initially assume that all models are fine-tuned from one of the given 12 base models only.
  • The model would be fine-tuned on a dataset with the same purpose as the pre-training. For example, the CodeGen model would have been fine-tuned on the code dataset. This is a strong assumption; however, the rationale is that the same would happen in the real world, as the miscreant is more likely to use a pre-trained model suitable for its specific task.

In the following sections, I will describe each applied heuristic to get the required attributions. Before that, here are some additional points to note:

  • In this document, I have supplied the heuristics in the order in which I used them during the competition. Of course, a better ordering exists, which ensures faster attributions to the correct model.
  • The tests mentioned in the document are not exhaustive but sufficient to get the required results.
  • All models were not available at all times, therefore some of the outputs will not contain all the models.
  • At the time of writing, I still don’t know the correct answers :(

Heuristic 1 : Temporal Filtering — The Covid Test

Different models are trained on different datasets. Specifically, one set of these models are trained on the latest datasets, and as such are expected to cover the recent events, while the other set is trained on older datasets(mostly before 2015) such as BookCorpus. Now consider an event/word introduced after a date D. Then base models trained on datasets older than D will not have such information, while those trained after D will have. Further if the event/word is very common, hypothesis is that its information will not be lost even after fine-tuning. Covid-19 is one of these events that happened recently, and query based on it can easily help us partition the models into two sets. Specifically, I ran the following query:

In the last few years, COVID-19 has

and here is the output generated by each of the models:

Model Outputs for all 12 models.

Clearly we see that models 1,2,3,7,9 are trained on datasets after 2020, while rest are trained before that. Thus the first 5 models can be one of:

Bloom-2b5, Bloom-350m, OPT-350m, GPT-J, GPT-Neo & Code-Gen

From now on, we will keep this point in mind, and perform separate tests on the two groups.

Current Classification

Heuristic 2: Token Test

Almost all models have similar vocabulary. The only major difference is with Multi-Lingual model, which has over 250K different tokens, some of them being even emojis! Even Bloom models have similar number of tokens, though these two models have different tokens. So we can exploit this fact right? Well no, because complex characters or tokens can be formed by combination of multiple unicode characters or tokens respectively. Thus we cannot exploit the fact directly. Instead, multiple number of prompts suggested something strange. The base model of XL-Net continuously produced random symbols. Similar trend was observed in Model 10.

Model 10 predicts special symbols repeatedly. Moreover tokens such as `и’ is characteristic of XL-Net

Moreover, model 10 generated large texts for almost all prompts. This is characteristic of xl-net, because after all the main aim of the original paper was to extend transformers to large sequence lengths. Several prompts make it very highly likely that model 10 is XL-Net! Moreover, from Heuristic 1 as well, nothing is violated. Thus we have our very first attribution! 11 more to go🙂.

Heuristic 3: Question Answering

I will admit, this is one task which didn’t help, either because it is not a good heuristic, or I didn’t choose good questions. Anyhow , I tried asking all models, the current President of USA (hoping some temporal effects to also kick in).

Simple Question Answering was indecisive

Heuristic 4: Code-Gen Test

Remember the assumption that pretrained models will be finetuned on its relevant task only? Thus the model generating good code, should most likely be Code-Gen model. So let’s try with some simple query to find sum of two numbers:

Model 7 seems to outperform other models

Model 7 performs much better than other models and infact is correct. However for a moment, consider if some other model was fine-tuned for code generation? Well technically it could have been, however looking at the output, we observe very consistent use of tabs and new lines in Model 7. This is because Code-Gen model has explicit single, double, triple tabs in its vocabulary, which other models lack. Thus even though other models can be trained for code generation task, it is very unlikely for them to generate such consistent code, and therefore Model 7 must be Code-Gen. And here we have our 2nd prediction!🎉

Current Classification

Heuristic Test 5: Maths Test

Models such as GPT-3, GPT-6J have good mathematical abilities, while smaller models are expected to perform poorly. You can try this to check for yourself, using the competition’s models. So lets ask some simple questions to these models:

Maths questions to 1–3 models. First 3 equations in each line were given as prompt.

All 3 models perform very poorly! But what about model 9? Well model 9 gets all questions all correct!

Maths Questions for Model 9

Thus model 9 is most likely going to be GPT-J. Just one or two more tests would confirm it.

Heuristic Test 6: Memorization — Fibonacci Series

This is an extremely decisive test, and the concept can help filter models of various sizes effectively(however at small scales, often all model fail and therefore cannot be distinguished among themselves). For this task, I will prompt the model to complete the Fibonacci Series. By following the procedure on base models, only GPT-J is expected to complete it successfully, while Bloom-2b5 might be able to complete it or offer near-correct results.

Output generated for input prompt containing first 9 Fibonacci numbers

Model 9 is clearly GPT-J, and model 2 is highly likely to be Bloom-2b5 since even base models failed to complete the Fibonacci series to the extent Model 2 did. Moreover, later multilingual tests will further confirm Model 2 as Bloom-2b5.

Heuristic 7: Extracting Information

These are some of the heuristics that again didn’t work :( These include extracting personally identifiable information(PII), extracting specific instances from dataset etc. However, in order to be successful, these methods often require large number of queries to Language Models, and since it was one of the criteria for scoring, I decided not to spend too much time into it. However in future, it might be a worthwhile idea to try!

Heuristic 8: Dialog Test

Similar to Heuristic 4, time to identify the Dialog GPT model.

A random test of models’ dialog generation abilities.

Clearly model 4,6,8,11 are not dialog models. Moreover, model 0’s output looks much more coherent. This also explains the fact, that model 0 was earlier outputting smaller sentences. This was because it was expecting input in a certain format(that of dialog models). I conducted further similar tests to confirm Model 0 to be Dialog-GPT!

Current Classification

Heuristic 9: Special Tokens

Special tokens are tokens which indicate important positions in a text such as start of text, end of text, separator etc. Of the pre-2020 models, all but Multilingual Model have same set of these special tokens. Therefore skilfully using them can reveal the real multilingual model. Specifically, GPT models use <|endoftext|> as end of sentence, while Multlilingual model uses </s> token(although, it is also the separator token). Thus passing it to the prompt, can reveal interesting results. Moreover, creating a typo in <|endoftext|> should have significant change in outputs of GPT models. Let’s put this to test.

All fail to respond to </s> properly. MultiLingual model might be treating </s> as separator token.
Introducing a typo has sudden change in some of the models.

Well nothing is too conclusive, but at least we see a significant change in outputs of Model 5,6,11. Introducing a typo, ensures that they still use the context of first part of sentence. Thus multilingual model must be one of 4,8 which we will decipher using the next heuristic.

Heuristic 10a: Multilingual Test

Now we will apply the multilingual test on pre-2020 models.

Aim is to translate simple words into multiple languages

From the above figure, model 4 has an extremely high likelihood of being Multilingual Model. The reason is that it can generate text in other languages for various examples. Although translations may not be correct, at least the generated text is grammatically/syntactically almost correct. Such capabilities are known to be missing in GPT models. Moreover, I ran some other multilingual tests, and the coherency and consistency of outputs confirm Model 4 with high likelihood. Moreover, model 8 was better than the remaining models. This along with Fibo test shows model 8 has high likelihood of GPT-XL. This way, we have our 6th and 7th model attribution. We are past halfway!

Heuristic 10b: Multilingual Test

Let’s now apply similar concept to post-2020 models.

From the above test and other tests, model 2 looked better in other languages. Thus it is highly likely to be Bloom-2b5. Moreover, in heuristic 6 we had seen model 2 was relatively much better in completing Fibonacci series. Thus Model2 is most likely Bloom-2b5.

Heuristic 10c: Multilinguality — Indic Languages

Bloom models perform very well on Indic languages such as Hindi and Bengali. However, both OPT and GPT-neo fail to do so. I tried this test with models 1 & 3; unfortunately, both failed to generate any relevant text. However, with a slight possibility, it might be because of how the input was being handled, and the correct text might not reach the model. Unfortunately, at that time, inference at model 2 wasn’t working. Therefore I was unable to confirm the hypothesis. Consequently, I concluded both Model 1 & 3 were not finetuned from Bloom. Moreover, by looking at the resemblance of outputs to the base model and running the previous heuristics on more set of examples, I concluded Model 1 to be GPT-neo and 3 to be OPT.

Heuristic 11: Finish the Sentence

While in Heuristic 10a, I argued model 8 must be GPT-XL, the argument was not very convincing. To differentiate between, two similarly trained models on similar datasets, but of different sizes, we need to see how performance on various tasks changes with scaling models. For this I referred to Meta’s recent paper titled ‘OPT: Open Pre-trained Transformer Language Models’.

Tasks with higher slope at lower parameters are relevant for us

From the above figure, HellaSwag looks like a good candidate. The dataset checks if a model can finish a sentence given 2 or more options. I took some examples from the dataset, and fed to the fine-tuned models:

A sample example from HellaSwag

Unfortunately, I wasn’t able to make much sense of results☹️. So I went with GPT-XL for Model 8.

Heuristic 12: Miscellaneous

We just have models 5,6,11 left to be predicted. However, all of them are pretty difficult to predict. Firstly because we have to choose out of only 2 models, secondly both models are almost similar i.e they are GPT-2 small and DistilGPT. For comparison, the former one has 117M parameters, while the latter one has 82M parameters! Thus such a small difference, is difficult to figure out. I tried various additional things apart from previous heuristics, such as test for historical facts, more fine-grained temporal filtering, however none of it was conclusive. Anyhow, based on qualitative analysis, I went with the best option: Model 5,11 as GPT-2 and 6 as DistilGPT. Let me know in comments if you have a better way to solve this case!

Final Classification

Future Directions

In future, it would be great to see development of suite of automated tools(like in cybersecurity space) for this task. Further almost all of the heuristics discussed in this blog can be further developed in a more systematic way. For example, the idea of 1.) temporal filtering can be further extended to create index of unique words for each of the common pretraining datasets using appropriate data mining techniques. 2.) Similarly Fibonacci idea can be extended to other types of mathematical sequences of varying difficulties, and response of models over scale can be studied on them. 3.) Language test is another very interesting direction to look into. 4.) Token test is also a very straightforward test, good for elimination of possibilities. Moreover for very large models, there are even more complicated tasks, using which they can be distinguished!

Conclusion

At the time of writing, I am not sure which of the attributions are correct. However, from final scores its clear, 7/12 predictions are correct. Moreover, the first 6 attributions were made with high confidence, thus it’s doubtful they’re erroneous.

Overall the competition was a great experience, and allowed the competitors to put in a lot of innovative solutions for this novel challenge!

I hope this article was encouraging and useful for the readers, and we will see many better solutions in this field in the coming days!

--

--

Pranjal Aggarwal

CS Senior@IITD. I learn, do and write about natural language processing, computer vision and cybersecurity.