PDFium浅入探索：初识PDFium开源项目与PDF文件格式。

2020-10-18 本文已影响0人天下第九九八十一

PDFium

PDFium 的编译非常快。小样儿，弄不了Chromium我还编译不了PDFium ？

结果…… PDFium 的编译同样需要depot_tools。git设置了代理后，可以将depot_tools拉取下来，但是无法运行，gclient卡半天啥都不干，最终给你来个报错。

C:\tmp\chromium\depot_tools>netsh
netsh>netsh
The following command was not found: netsh.
netsh>winhttp
netsh winhttp>set proxy localhost:10434

Current WinHTTP proxy settings:

    Proxy Server(s) :  localhost:10434
    Bypass List     :  (none)

netsh winhttp>set NO_AUTH_BOTO_CONFIG=C:\tmp\chromium\depot_tools\boto.cfg
The following command was not found: set NO_AUTH_BOTO_CONFIG=C:\tmp\chromium\depot_tools\boto.cfg.
netsh winhttp>
C:\tmp\chromium\depot_tools>set NO_AUTH_BOTO_CONFIG=C:\tmp\chromium\depot_tools\boto.cfg

C:\tmp\chromium\depot_tools>set http_proxy=http://localhost:10434

C:\tmp\chromium\depot_tools>
C:\tmp\chromium\depot_tools>set https_proxy=http://localhost:10434

C:\tmp\chromium\depot_tools>git config --global http.proxy http://localhost:10434

C:\tmp\chromium\depot_tools>git config --global https.proxy http://localhost:10434

C:\tmp\chromium\depot_tools>gclient config --unmanaged https://pdfium.googlesource.com/pdfium.git

…… 卡半天，最终出错

只能退而求其次，按照这篇文章介绍的方法，编译旧版代码：

https://zhuanlan.zhihu.com/p/37729756

资源只有两个github上的旧仓库：

https://github.com/PDFium/PDFium
https://github.com/bnoordhuis/gyp (基于python的vs项目生成工具，解压至PDFium/build/gyp)

编译过程：

一、生成vs项目。

照做，编辑pdfium.gyp，去掉v8。
安装python2.7，确保命令行敲python出来的的是python2.7而不是python3。
打开命令行，进入build目录，执行命令：

python gyp_pdfium.py

运行后，会根据不同平台生成对应工程文件。……

二、编译运行项目

照做，注释掉一些javascript相关的代码，比如初始化GetJSRuntimeFactory，释放release等等。

编译，这就很简单了。然后，设置test为启动项目，debug参数写入 --bmp pdf文件全路径，F5运行，这就跑起来了！示例很简单，将PDF渲染输出为bmp图片。

PDF文件格式

相对于Markdown，PDF处于更偏向于印刷的阶段，可以看作印刷的中间体、编辑器的交换文件。Markdown是纯文本标记语言，word解压后有自己的标记格式，那么PDF是否也有这样的文本标记呢？答案是肯定的。

使用Apache-PDFBox稍做测试：

        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.21</version>
        </dependency>

吐槽下一Maven，许多情况都可能导致Maven Import空转，无法导入想要的依赖库。这种情况下建议用IDEA新建项目，然后右击选Add Framework Support，加入Maven管理，最后在pom文件中<project><dependencies>标签下插入上面的依赖。

        PDDocument document = null;
        try {
            document = PDDocument.load(new File("D:\\PDFJsAnnot\\sample.pdf"));
        } catch (IOException ex) {
            System.out.println("" + ex);
        }
        PDPageTree pages = document.getDocumentCatalog().getPages();
        Iterator<PDPage> iter = pages.iterator();
        int i =1;
        String name = null;

        while (iter.hasNext()) {
            PDPage page = (PDPage) iter.next();
            InputStream cin = page.getContents();
            byte[] buffer = new byte[1024];
            ByteOutputStream out = new ByteOutputStream();
            int len;
            while ((len=cin.read(buffer))>0) {
                out.write(buffer, 0, len);
            }
            buffer = out.getBytes();
            CMN.Log(new String(buffer));
            
            PDResources resources = page.getResources();
            COSDictionary pageImages = resources.getCOSObject();
            CMN.Log(pageImages);
        }

输出：

1 0 0 1 81.96 735.02 cm
1 1 1 1 k 1 1 1 1 K
1 0 0 1 388.54 0 cm
1 1 1 1 k 1 1 1 1 K
1 0 0 1 -298.27 -236.43 cm
1 1 1 1 k 1 1 1 1 K
1 1 1 1 k 1 1 1 1 K
1 0 0 1 5.86 0 cm
BT
/F25 20.66 Tf 0 0 Td[(Sample)-250(PDF)-250(Document)]TJ/F26 14.35 Tf 57.7 -53.8 Td[(Robert)-249(Maron)]TJ -19.8 -17.93 Td[(Grze)16(gorz)-250(Grudzi)]TJ 97.78 0.14 Td[(�)]TJ -1.21 -0.14 Td[(nski)]TJ -89.12 -34.04 Td[(February)-249(20,)-250(1999)]TJ
ET
1 0 0 1 -96.13 -403.05 cm
1 1 1 1 k 1 1 1 1 K
1 0 0 1 388.54 0 cm
1 1 1 1 k 1 1 1 1 K

不难观察出，PDF一页的内容就是首尾两个矩阵一样的东西包绕着BT文本开始、ET文本结束，许多这样的结构罗列成一页内容。内中又有许多Td、TJ 包绕的文本段，像这样Td[(单词)-间距(单词)-间距(单词)]TJ，可以看到似乎是没有空格、tab的。

看到这里我就想，无非是一些坐标+字符数据，我能否抛开PDFium自己渲染PDF呢？于是我在自己的安卓播放器项目里简单测试了一下，一个FrameLayout里放50个绘制文本的控件，每个View绘制10*10个字符，一页总共5000个字符，结果页面的缩放、移动就很卡，对比测试一下，果断放弃。这说明PDFium还是有自己特别的文字渲染技术的。

PDFium浅入探索：初识PDFium开源项目与PDF文件格式。

PDFium

一、生成vs项目。

二、编译运行项目

PDF文件格式

猜你喜欢

热点阅读