python环境解析任意编程语言 tree-sitter使用方法（1）

背景

我个人目前仍在研究代码有关的知识。目前基于深度学习表征代码的论文越来越卷了，用到的工具越来越高级了。目前有一个开源项目tree-sitter，专门用于解析具体语法树，声称：

足够通用，能用于任何编程语言
足够迅速，能在文本编辑器中响应每一个用户输入
足够鲁棒，即便语法错误也能解析语法树
无依赖性，能很好地嵌入于程序中

在官方提供的playground玩了玩，的确1、2、3点都很符合。
所以个人做（水）了本篇文章。

安装

py-tree-sitter已经做了详细的描述，所以这里简短描述，顺便说个遇到的问题。

找个合适的python环境，install

pip3 install tree_sitter

对于要解析的编程语言，随便创建文件夹（比如vendor），该目录git clone指定语言的仓库，在tree-sitter官网这里找，比如我对Java、Python、C++、C#、JS感兴趣：

git clone https://github.com/tree-sitter/tree-sitter-javagit clone https://github.com/tree-sitter/tree-sitter-pythongit clone https://github.com/tree-sitter/tree-sitter-cppgit clone https://github.com/tree-sitter/tree-sitter-c-sharpgit clone https://github.com/tree-sitter/tree-sitter-javascript

需要注意的是，C++对应cpp，C#对应c-sharp，后面使用的时候需要认清楚官方定义的名称。

创建build文件夹，用于保存xxx.so文件，该文件相当于自定义的编译器，用于解析代码生成语法树。然后复制以下代码运行。

from tree_sitter import LanguageLanguage.build_library(# so文件保存位置'build/my-languages.so',# vendor文件下git clone的仓库['vendor/tree-sitter-java','vendor/tree-sitter-python','vendor/tree-sitter-cpp','vendor/tree-sitter-c-sharp','vendor/tree-sitter-javascript',])

这里有一个小插曲，个人用windows电脑，一开始运行这段代码直接报错，好像说缺少什么msvc文件，所以我还下载了visual studio才解决。现在看到tree-sitter的__init__.py文件下，有一条compiler = new_compiler()代码，发现以下代码：

if compiler is None:# get_default_compiler 用于选择_default_compilers：'''_default_compilers = (('cygwin.*', 'unix'),('posix', 'unix'),('nt', 'msvc'),)'''compiler = get_default_compiler(plat) # windows是nt'''compiler_class = { 'unix':('unixccompiler', 'UnixCCompiler', "standard UNIX-style compiler"), 'msvc':('_msvccompiler', 'MSVCCompiler', "Microsoft Visual C++"), 'cygwin':('cygwinccompiler', 'CygwinCCompiler', "Cygwin port of GNU C Compiler for Win32"), 'mingw32': ('cygwinccompiler', 'Mingw32CCompiler', "Mingw32 port of GNU C Compiler for Win32"), 'bcpp':('bcppcompiler', 'BCPPCompiler', "Borland C++ Compiler"), }'''(module_name, class_name, long_description) = compiler_class[compiler]

看了代码后就清楚了，我之前电脑缺少Microsoft Visual C++，安装visual studio，配置C++后就好了。

解析

from tree_sitter import Language, Parser# 注意C++对应cpp，C#对应c_sharp（！这里短横线变成了下划线）# 看仓库名称CPP_LANGUAGE = Language('build/my-languages.so', 'cpp')CS_LANGUAGE = Language('build/my-languages.so', 'c_sharp')# 举一个CPP例子cpp_parser = Parser()cpp_parser.set_language(CPP_LANGUAGE)# 这是b站网友写的代码，解析看看cpp_code_snippet = '''int mian{piantf("hell world");remake O;}'''# 没报错就是成功tree = cpp_parser.parse(bytes(cpp_code_snippet, "utf8"))# 注意，root_node 才是可遍历的树节点root_node = tree.root_node

最近，个人还发现了版本问题。tree-sitter 0.19.0版本运行 parser.set_language()出现：

ValueError: Incompatible Language version 14. Must be between 13 and 13

这有可能tree-sitter版本太旧所致，重装即可解决。

语法树属性（一部分）

通过debugger，可以查看语法树节点的属性（指root_node下的节点），可以发现：

# 孩子节点【节点数、节点列表】root_node.child_count: introot_node.children: list[Node]| None# 该语法树节点对应代码字符串位置【左闭右开】root_node.start_byte: introot_node.end_byte: int# 语法树节点对应代码 (行, 列) 位置元组root_node.start_point: tuple[int, int]root_node.end_point: tuple[int, int]'''以上的行、列以及字符串位置都是以0开始'''# 语法树命名节点、命名类型 以及 语法树对应的文本# 因为具体语法树有代码所有的标记，所以一些符号可能没有类型# 我猜测该属性可以用于区别具体语法树符号节点，构建抽象语法树root_node.is_named: boolroot_node.type: str # 没有类型时，这里显示代码原始标记root_node.text: bytes# 语法树父节点root_node.parent: Node| None# 语法树左兄弟、左命名兄弟root_node.prev_sibling: Node| Noneroot_node.prev_named_sibling: Node| None# 语法树右兄弟、右命名兄弟root_node.next_sibling: Node| Noneroot_node.next_named_sibling: Node| None

还有其他节点，不过我觉得有些trivial，这里不展开分析了。

解析小例子

我发现这个tree-sitter库是看到论文GraphCodeBert后才了解到，后来，很多研究比如UniXcoder，CodeT5，TreeBert和SynCoBert【不开源的论文】等等都用了该库。
【吐槽：深度学习表征代码越来越卷了，Money and Equipment Is All You Need 属于是了，各个下游任务刷榜。本来实验室刚从传统算法转机器学习，就一个GPU，留给我硕士菜鸡的毕业的机会都快弄没了。】

GraphCodeBert使用语法树分词的方法还是不错的，这里是原论文别人写的代码，GraphCodeBert的分词代码网址在这里，个人觉得很不错，供参考：

from tree_sitter import Language, Parserdef tree_to_token_index(root_node):'''定位代码token，返回token在代码中原始位置从root_node开始，深度遍历其孩子节点：1. 如果root_node没有孩子（root_node是叶节点）或者root_node是字符串或者注释，直接返回code_snippet对应的位置个人猜想：估计某些编程语言的string和comment类型的语法树只有单引号、双引号叶子节点，而该节点内容被忽略掉了2. 如果有孩子节点，深度遍历，回溯时获取结果使用的属性: root_node.start_point: tuple[int, int]root_node.end_point: tuple[int, int]参数: root_node: Node返回: code_tokens: list[tuple[tuple[int,int], tuple[int, int]]]'''# 我突然发现该代码没有检测到cpp的string（也就是"hell world"），所以我改了第一行的第二个条件# 其他编程语言可能会有改变，所以需要小心谨慎# 原代码行：# if (len(root_node.children) == 0 or root_node.type == 'string') and root_node.type != 'comment':if (len(root_node.children) == 0 or root_node.type.find('string') != -1) and root_node.type != 'comment':return [(root_node.start_point, root_node.end_point)]else:code_tokens = []for child in root_node.children:code_tokens += tree_to_token_index(child)return code_tokensdef index_to_code_token(index, code):'''从 tree_to_token_index 返回的token位置元组列表 以及 代码行 生成代码token这里第二个参数，GraphCodeBert项目源代码写的是code，不是line_of_code1. 如果token起止都在同一行定位该代码行，定位改行的起止列，获取token2. token跨行【比如Python三个单引号包围的注释、或者Javascript中的模板字符串等等】1) 定位首行的token所在列2) 循环遍历到目标行之前，所有内容3) 定位末行的token所在列以上内容拼接即可参数: index: list[tuple[tuple[int,int], tuple[int, int]]]参数: code: list[str]返回: s: str'''start_point = index[0]end_point = index[1]if start_point[0] == end_point[0]:s = code[start_point[0]][start_point[1]:end_point[1]]else:s = ""s += code[start_point[0]][start_point[1]:]for i in range(start_point[0]+1, end_point[0]):s += code[i]s += code[end_point[0]][:end_point[1]]return sif __name__ == '__main__':# 声明CPP代码解析器CPP_LANGUAGE = Language('build/my-languages.so', 'cpp')cpp_parser = Parser()cpp_parser.set_language(CPP_LANGUAGE)# 这c语言不是我写的cpp_code_snippet = '''int mian{piantf("hell world");remake O;}'''# 完成解析，获取根节点tree = cpp_parser.parse(bytes(cpp_code_snippet, "utf8"))root_node = tree.root_node# 获取token对应的位置tokens_index = tree_to_token_index(root_node)# 获取代码行cpp_loc = cpp_code_snippet.split('\n')# 获取对应每个位置下的tokencode_tokens = [index_to_code_token(x, cpp_loc) for x in tokens_index]# ['int', 'mian', '{', 'piantf', '(', '"hell world"', ')', ';', 'remake', 'O', ';', '}']print(code_tokens)

提取具体节点

其实tree-sitter还可以手动配置想要的语法树节点，通过定义query，便于直接提取特定语法树节点。我看有代码为了定位语法树节点，dfs语法树，手动写判断，一大堆代码，还要回溯判断父节点，太困难了。等一段时间后，在接下来的文章：
python环境解析任意编程语言 tree-sitter使用方法（2）

其他类似工作

python环境做C语言分析-pycparser的使用方法（1）

python环境做C语言分析-pycparser的使用方法（2）

python环境解析任意编程语言 tree-sitter使用方法（1）

背景

安装

语法树属性（一部分）

解析小例子

提取具体节点

其他类似工作

最新关注

热文推荐

平衡效率与安全-谋定论道·经信研究-谢雯：区块链效率安全性

元宇宙这个筐，快被骗子撑破了

【锟斤拷�⊠是怎样炼成的】——两分钟帮你彻底弄懂计算机的编码原理

数学模型在人工智能中的使用：统计学和概率论

echarts中横坐标显示为time，使用手册

MAC(适用于M1，M2芯片)下载Java8（官方 ARM64 JDK1.8）安装、配置环境，支持动态切换JDK

python环境解析任意编程语言 tree-sitter使用方法（1）

背景

安装

语法树属性（一部分）

解析小例子

提取具体节点

其他类似工作

相关文章

最新关注

热文推荐