今天需要使用profiler来分析LLM的性能,所以特地的尝试了一下,我这里把示例代码分享给搭建,希望大家编程顺利:

import timeimport torchfrom transformers import AutoTokenizer, AutoModelimport torch.profiler as profilermodel_name_or_path = 'THUDM/chatglm-6b'tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True).half().cuda()model = model.eval()prompt = "你好"inputs = tokenizer([prompt], return_tensors="pt")inputs = inputs.to("cuda")prof = profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA,],schedule=torch.profiler.schedule(wait=1,warmup=1,active=2,repeat=1),)with torch.no_grad():for i in range(5):result = model(**inputs)with torch.no_grad():for i in range(10):start = time.perf_counter()# response, history = model.chat(tokenizer, "你好", history=[])# print(response)result = model(**inputs)hf_cost = (time.perf_counter() - start) * 1000print("Speed tokenizer:", hf_cost)prof.step()print(prof.key_averages().table(sort_by="self_cpu_time_total"))

我给的示例是chatglm的,有需要的可以换成其他的模型,原理是一样的。

参考文献

PyTorch Profiler