Go编程实战：博客备份

学习一门新的编程语言的最佳途径就是用它来编写一个实用程序。

在 “博客备份” 一文中，编写了两个 python 脚本，然后使用一个命令行将博客备份流程串联起来。这里，我们将使用 Go 来实现。学习一门新的编程语言的最佳途径就是用它来编写一个实用程序，尽可能用到大部分语言特性。

不太熟悉 Go 语言的读者，可以阅读之前写的两篇文章：“Go语言初尝”，“Go 面向对象简明教程”。

整体设计

设计五个 go 函数，分别实现如下功能：

GetFiles: 从命令行参数获取博文备份文件路径。
ReadXml: 读取博客备份文件内容（XML格式），返回博文对象。
GetMdLinks：从博文对象中取出博文链接对象数组。
WriteMarkdown：根据博文链接对象，下载 HTML 文件，并转换成 markdown 文件。
SequentialRun：使用串行方式，将以上函数串联成一个完整的流程。

为了避免重复在文中粘贴代码，先给出完整程序，读者可以先概览一遍，复制出来，对照解读来看。程序解读会比较简洁，不太明白的地方可以留言哈。

文件内容

为了调试程序，我们仅选取两篇博文内容组成 XML 文件。

/tmp/cnblogs_backup.xml

            博客园-编程大观园        https://www.cnblogs.com/lovesqcc/        Enjoy programming, Enjoy life ~~~　设计，诗歌与爱        zh-cn        Tue, 07 Feb 2023 14:39:24 GMT        Tue, 07 Feb 2023 14:39:24 GMT        60                    专业精进之路：构建扎实而有实效的编程知识体系            http://www.cnblogs.com/lovesqcc/archive/2023/02/01/17084292.html            琴水玉            琴水玉            Wed, 01 Feb 2023 14:05:00 GMT            http://www.cnblogs.com/lovesqcc/archive/2023/02/01/17084292.html                                                                    知乎一答：什么样的知识是有价值的            http://www.cnblogs.com/lovesqcc/archive/2023/01/29/17072238.html            琴水玉            琴水玉            Sun, 29 Jan 2023 03:29:00 GMT            http://www.cnblogs.com/lovesqcc/archive/2023/01/29/17072238.html

完整程序

执行:

go run blog_backup.go /tmp/cnblogs_backup.xml或者go build blog_backup.go./blog_backup /tmp/cnblogs_backup.xml

blog_backup.go

package mainimport (    "fmt"    "os"    "strings"    "encoding/xml"    "io/ioutil"    "net/http"    "github.com/PuerkitoBio/goquery"    "github.com/JohannesKaufmann/html-to-markdown")type BlogRss struct {   XMLName xml.Name `xml:"rss"`   Channel *BlogChannel `xml:channel`}type BlogChannel struct {   XMLName xml.Name `xml:"channel"`   Title string `xml:"title"`   Link  string `xml:"link"`   Description string `xml:"description"`   Language string `xml:"language"`   LastBuildDate string `xml:"lastBuildDate"`   PubDate string `xml:"pubDate"`   Ttl     int   `xml:"ttl"`   Items    []BlogItem  `xml:"item"`}type BlogItem struct {    Title string `xml:"title"`    Link  string `xml:"link"`    Creator string `xml:"dc:creator"`    Author  string `xml:"author"`    PubDate string `xml:"pubDate"`    guid    string `xml:"guid"`    Description string `xml:description`}type MarkdownFile struct {    Title string    Link  string}func ReadXml(fpath string) (*BlogRss, error) {    fp,err := os.Open(fpath)    if err != nil {        fmt.Printf("error open file: %s error: %v", fp, err)        return nil, err    }    defer fp.Close()    data, err := ioutil.ReadAll(fp)    if err != nil {        fmt.Printf("error read file: %s error: %v", fpath, err)        return nil, err    }     blogRss := BlogRss{}    err = xml.Unmarshal(data, &blogRss)    if err != nil {        fmt.Printf("error unmarshal xml data: %v error: %v", data, err)        return nil, err    }    blogchannelp := blogRss.Channel    fmt.Printf("%+v\n", *blogchannelp)    blogitems := (*blogchannelp).Items    for _, item := range blogitems {        fmt.Printf("%+v\n", item)    }    return &blogRss, nil }func GetMdLinks(blogRss *BlogRss) []MarkdownFile {    blogchannelp := blogRss.Channel    blogitems := (*blogchannelp).Items    mdlinks := make([]MarkdownFile, 0)    for _, item := range blogitems {        mdlinks = append(mdlinks, MarkdownFile{Title: item.Title, Link: item.Link})    }    return mdlinks    }func GetMdLinksPtr(blogRss *BlogRss) *[]MarkdownFile {    blogchannelp := blogRss.Channel    blogitems := (*blogchannelp).Items    mdlinks := make([]MarkdownFile, 0)    for _, item := range blogitems {        mdlinks = append(mdlinks, MarkdownFile{Title: item.Title, Link: item.Link})    }    return &mdlinks    }func WriteMarkdown(mdlink MarkdownFile) {    urllink := mdlink.Link    filename := mdlink.Title    resp, err := http.Get(urllink)    if err != nil {        fmt.Printf("error get url: %s error: %v", urllink, err)    }        doc, err := goquery.NewDocumentFromReader(resp.Body)    if err != nil {        fmt.Printf("err: %v", err)    }    postbody := doc.Find("#cnblogs_post_body")    fmt.Printf("%s\n", postbody.Text())    converter := md.NewConverter("", true, nil)    markdown, err := converter.ConvertString(postbody.Text())    if err != nil {        fmt.Printf("err parse html: %v", err)    }    ioutil.WriteFile(filename + ".md", []byte(markdown), 0666)    resp.Body.Close()}func SequentialRunMultiple(files []string) {    for _, f := range files {        SequentialRun2(f)    }}func SequentialRun(fpath string) {    blogRssp, err := ReadXml(fpath)    if err != nil {       os.Exit(2)    }    mdlinks := GetMdLinks(blogRssp)    for _, mdlink := range mdlinks {        linktrimed := strings.Trim(mdlink.Link, " ")        if linktrimed == "" {            continue        }         WriteMarkdown(mdlink)    }}func SequentialRun2(fpath string) {    blogRssp, err := ReadXml(fpath)    if err != nil {       os.Exit(2)    }    mdlinksptr := GetMdLinksPtr(blogRssp)    for i:=0 ; i<len(*mdlinksptr); i++ {        linktrimed := strings.Trim((*mdlinksptr)[i].Link, " ")        if linktrimed == "" {            continue        }         WriteMarkdown((*mdlinksptr)[i])    }}func GetFiles() []string {    fpaths := make([]string, 0)    for _, arg := range os.Args[1:] {        argtrimed := strings.Trim(arg, " ")         if argtrimed == "" {            continue        }         fpaths = append(fpaths, argtrimed)    }    fmt.Println(cap(fpaths))    fmt.Println(fpaths)    return fpaths}func main() {    SequentialRunMultiple(GetFiles())}

GetFiles

GetFiles 的作用是从命令行参数获取博客备份文件。支持读取多个博客备份文件。也可以用于测试。比如测试文件不存在的情形。

这里有一个习惯用法。掌握习惯用法，初学者可以很快上手一门新语言。_ 表示用不到，可以忽略这个值。习惯性打印日志，有利于快速排错。

送给编程初学者一句忠告：打几行日志很廉价，但你的时间价值千金。

for _, arg := range os.Args[1:] {}

ReadXML

解析 XML 的函数见： ReadXml。这里的重点是，要建立与 XML 内容结构对应的 struct 。

要点如下：

XMLName xml.Name 要定义的元素是包含整个 XML 内容的顶层元素（除开 xml 元素外）。比如，这个 xml 文件的顶层元素是 rss，那么需要定义：XMLName xml.Name xml:"rss"。其下是这个 XML 嵌套的具体元素内容。

type BlogRss struct {   XMLName xml.Name `xml:"rss"`   Channel *BlogChannel `xml:channel`}

根据嵌套内容结构，定义对应的嵌套对象：BlogRss => BlogChannel => BlogItem。每个 struct 对象的属性就是这个对象的子元素。属性首字母大写，是为了能够直接引用对象的字段。
`xml:”item”` 不要写成 `xml:item` 。 item 要用双引号包起来，不然会解析不出来。

GetMdLinks

这个主要就是结构体遍历操作。对于大量博文来说，直接传入结构体对象切片是开销较大的，可以改成结构体切片指针类型。相应地，可替换成 GetMdLinksPtr 和 SequentialRun2 函数。

注意，go 数组指针不支持指针运算。因此，使用数组指针遍历数组，没有采用 range 方式。

WriteMarkdown

这个函数根据指定链接，下载 HTML 文件，使用 goquery 解析出 HTML 博文内容块，然后使用 html-to-markdown 来生成 Markdown 文件。

基本就是搜索网上相关程序，拼接而成。因此，知识信息搜索能力还是很重要滴。

SequentialRun

这个函数的作用就是串联逻辑以上逻辑单元，构建完整流程。

这里展示了程序模块化的优势。当每个逻辑单元设计成短小而正交的，那么整个流程看起来会很清晰。对于编程初学者，这一点尤为重要。

报错及解决Go初学者容易犯的错误

习惯于写 links = xxx。报错：undefined: links。应该写成 links := xxx 。 Go 的赋值语句是 := 。
未使用的变量。报错：b declared and not used。 Go 不允许变量定义了但未使用，否则报编译错误。
变量声明顺序写反。报错：syntax error: unexpected ], expected operand。这是因为，make([]string, 32) 写成了 make(string[], 32)。变量声明顺序是变量名变量类型。
变量名和方法名大写，表示变量和方法的包可见权限。很多习惯于写 C 和 Java 的人需要一个适应过程。
丢掉返回值。append(links, item.Link) 。报错： append(links, item.Link) (value of type []string) is not used。这是因为 append 返回一个切片，而这个切片被丢弃了。需要写成： links = append(links, item.Link)。
不允许导入未使用的包。报错：”github.com/JohannesKaufmann/html-to-markdown” imported as md and not used
多重赋值错误。报错：Go assignment mismatch: 1 variable but converter.ConvertString returns 2 values。Go 的很多方法都定义了多个返回值，因为要包含错误信息 err。需要写成 markdown, err := converter.ConvertString(postbody.Text())。
方法签名的变量定义写错。比如 func SequentialRunMultiple(files []string) 写成了 func SequentialRunMultiple(files) 或者写成了 func SequentialRunMultiple(string[] files) 或者写成了 func SequentialRunMultiple([]string files)。这几种我都犯过了。

Go 的语法可以说是一个“大杂烩”。虽然融合了很多语言的优点，但感觉缺乏一致性。为了编译省事，把很多负担都转移给了开发者。虽说部分做法初衷是好的。作为一位熟悉多门编程语言的多年开发者，我也感觉不太适应，很容易写错。需要一点时间扭转过来。语言使用的不适应感，会阻滞这门语言的快速普及化。将来再出现一门更易于使用的编程语言，这门语言虽然有先发优势，也会被埋没成为小众语言，即使是大牛操刀而成。

依赖报错及解决

与 python 相比， Go 的依赖管理也令人不满意。python 只要安装了 pip，使用 pip install 安装模块，就可以直接使用了。而 Go 报各种问题，且报错不友好，需要去了解详细信息，对初学者不友好。

引入 goquery 后执行 go run blog_backup.go 报错：

go run blog_backup.goblog_backup.go:9:5: no required module provides package github.com/PuerkitoBio/goquery: go.mod file not found in current directory or any parent directory; see 'go help modules'

安装 goquery，执行如下命令报错:

➜  basic go get -u github.com/PuerkitoBio/goquerygo: go.mod file not found in current directory or any parent directory.'go get' is no longer supported outside a module.To build and install a command, use 'go install' with a version,like 'go install example.com/cmd@latest'For more information, see https://golang.org/doc/go-get-install-deprecationor run 'go help get' or 'go help install'.

执行如下命令又报错：

➜  basic go install github.com/PuerkitoBio/goquery@latest go: downloading github.com/PuerkitoBio/goquery v1.8.1go: downloading github.com/andybalholm/cascadia v1.3.1go: downloading golang.org/x/net v0.7.0package github.com/PuerkitoBio/goquery is not a main package

解决办法：退回到 go 项目根目录 gostudy，执行：

go mod init gostudygo get github.com/PuerkitoBio/goquerygo get github.com/JohannesKaufmann/html-to-markdown

参考文献

The Go Programming Language
Go XML处理
go解析xml的三种方式
深度解密Go语言之Slice
Golang文件写入的四种方式

Go编程实战：博客备份

最新关注

热文推荐

Open Cascade 中的 AIS_InteractiveContext、V3d_Viewer 与 V3d_View 之间的关系

react中form.setFieldvalue数据回填时 value和text不对应的问题

第十四届蓝桥杯第三期模拟赛 C/C++ B组原题与详解

一文了解Jackson注解@JsonFormat及失效解决

[ctfshow 2023元旦水友赛]web题解

【数据分析与挖掘】数据预处理

Go编程实战：博客备份

相关文章

最新关注

热文推荐