jjzjj

php - 如何用PHP从word文档中提取文本内容?

coder 2023-06-14 原文

我想用PHP从word文档中提取文本内容。

我在 Microsoft Word for Mac 2011 中创建了一个新的 Word 文档。 编辑:还通过在 Windows 7 下的 Microsoft Word 中创建相同的文档进行了测试。

文档的内容是

The quick brown fox jumps over the lazy dog

我已将它作为 Word 97-2004 文档 (.doc) 保存到磁盘。

我正在使用 phpoffice/phpword和这段代码来提取文本:

<?php

$source = "word.doc";

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

$text = '';

$sections = $phpWord->getSections();

foreach ($sections as $s) {
    $els = $s->getElements();
    foreach ($els as $e) {
        if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
            $text .= $e->getText();
        } elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
            $text .= " \n";
        } else {
            throw new Exception('Unknown class type ' . get_class($e));
        }
    }
}

print $text;

这段代码的输出只是部分文本:

The quick brown fox j

是代码有问题,还是某种兼容性问题?

编辑:

如果我在 foreach ($els as $e) { 之前添加一个 var_dump($els); 输出是这样的:

array(1) {
  [0]=>
  object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
    ["text":protected]=>
    string(21) "The quick brown fox j"
    ["fontStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["type":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "text"
      ["name":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["hint":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["size":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["color":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["bold":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["italic":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["underline":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "none"
      ["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["scale":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
      object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
        ["aliases":protected]=>
        array(1) {
          ["line-height"]=>
          string(10) "lineHeight"
        }
        ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(6) "Normal"
        ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(0) ""
        ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(true)
        ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        int(0)
        ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        array(0) {
        }
        ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["borderTopSize":protected]=>
        NULL
        ["borderTopColor":protected]=>
        NULL
        ["borderLeftSize":protected]=>
        NULL
        ["borderLeftColor":protected]=>
        NULL
        ["borderRightSize":protected]=>
        NULL
        ["borderRightColor":protected]=>
        NULL
        ["borderBottomSize":protected]=>
        NULL
        ["borderBottomColor":protected]=>
        NULL
        ["styleName":protected]=>
        NULL
        ["index":protected]=>
        NULL
        ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
        bool(false)
      }
      ["shading":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["paragraphStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(6) "Normal"
      ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(0) ""
      ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(true)
      ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      int(0)
      ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      array(0) {
      }
      ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["borderTopSize":protected]=>
      NULL
      ["borderTopColor":protected]=>
      NULL
      ["borderLeftSize":protected]=>
      NULL
      ["borderLeftColor":protected]=>
      NULL
      ["borderRightSize":protected]=>
      NULL
      ["borderRightColor":protected]=>
      NULL
      ["borderBottomSize":protected]=>
      NULL
      ["borderBottomColor":protected]=>
      NULL
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["phpWord":protected]=>
    object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
      ["sections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(1) {
        [0]=>
        object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
          ["container":protected]=>
          string(7) "Section"
          ["style":"PhpOffice\PhpWord\Element\Section":private]=>
          object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
            ["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
            string(8) "portrait"
            ["paper":"PhpOffice\PhpWord\Style\Section":private]=>
            object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
              ["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
              array(6) {
                ["A3"]=>
                array(3) {
                  [0]=>
                  int(297)
                  [1]=>
                  int(420)
                  [2]=>
                  string(2) "mm"
                }
                ["A4"]=>
                array(3) {
                  [0]=>
                  int(210)
                  [1]=>
                  int(297)
                  [2]=>
                  string(2) "mm"
                }
                ["A5"]=>
                array(3) {
                  [0]=>
                  int(148)
                  [1]=>
                  int(210)
                  [2]=>
                  string(2) "mm"
                }
                ["Folio"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(13)
                  [2]=>
                  string(2) "in"
                }
                ["Legal"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(14)
                  [2]=>
                  string(2) "in"
                }
                ["Letter"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(11)
                  [2]=>
                  string(2) "in"
                }
              }
              ["size":"PhpOffice\PhpWord\Style\Paper":private]=>
              string(2) "A4"
              ["width":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(11870)
              ["height":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(16787)
              ["styleName":protected]=>
              NULL
              ["index":protected]=>
              NULL
              ["aliases":protected]=>
              array(0) {
              }
              ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
              bool(false)
            }
            ["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
            int(11906)
            ["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
            int(16838)
            ["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
            int(0)
            ["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1)
            ["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["borderTopSize":protected]=>
            NULL
            ["borderTopColor":protected]=>
            NULL
            ["borderLeftSize":protected]=>
            NULL
            ["borderLeftColor":protected]=>
            NULL
            ["borderRightSize":protected]=>
            NULL
            ["borderRightColor":protected]=>
            NULL
            ["borderBottomSize":protected]=>
            NULL
            ["borderBottomColor":protected]=>
            NULL
            ["styleName":protected]=>
            NULL
            ["index":protected]=>
            NULL
            ["aliases":protected]=>
            array(0) {
            }
            ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
            bool(false)
          }
          ["headers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["footers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["elements":protected]=>
          array(1) {
            [0]=>
            *RECURSION*
          }
          ["phpWord":protected]=>
          *RECURSION*
          ["sectionId":protected]=>
          int(1)
          ["docPart":protected]=>
          string(7) "Section"
          ["docPartId":protected]=>
          int(1)
          ["elementIndex":protected]=>
          int(1)
          ["elementId":protected]=>
          NULL
          ["relationId":protected]=>
          NULL
          ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          int(0)
          ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          NULL
          ["mediaRelation":protected]=>
          bool(false)
          ["collectionRelation":protected]=>
          bool(false)
        }
      }
      ["collections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(5) {
        ["Bookmarks"]=>
        object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Titles"]=>
        object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Footnotes"]=>
        object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Endnotes"]=>
        object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Charts"]=>
        object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
      }
      ["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
      array(3) {
        ["DocInfo"]=>
        object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
          ["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          array(0) {
          }
        }
        ["Protection"]=>
        object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
          ["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
          NULL
        }
        ["Compatibility"]=>
        object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
          ["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
          int(12)
        }
      }
    }
    ["sectionId":protected]=>
    NULL
    ["docPart":protected]=>
    string(7) "Section"
    ["docPartId":protected]=>
    int(1)
    ["elementIndex":protected]=>
    int(1)
    ["elementId":protected]=>
    string(6) "5d531b"
    ["relationId":protected]=>
    NULL
    ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    int(0)
    ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    string(7) "Section"
    ["mediaRelation":protected]=>
    bool(false)
    ["collectionRelation":protected]=>
    bool(false)
  }
}

最佳答案

尝试先创建你的阅读器

$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}

答案基于此 exampleAPI documentation

关于php - 如何用PHP从word文档中提取文本内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41216935/

有关php - 如何用PHP从word文档中提取文本内容?的更多相关文章

  1. ruby - 将数组的内容转换为 int - 2

    我需要读入一个包含数字列表的文件。此代码读取文件并将其放入二维数组中。现在我需要获取数组中所有数字的平均值,但我需要将数组的内容更改为int。有什么想法可以将to_i方法放在哪里吗?ClassTerraindefinitializefile_name@input=IO.readlines(file_name)#readinfile@size=@input[0].to_i@land=[@size]x=1whilex 最佳答案 只需将数组映射为整数:@land边注如果你想得到一条线的平均值,你可以这样做:values=@input[x]

  2. ruby-on-rails - Rails 3 I18 : translation missing: da. datetime.distance_in_words.about_x_hours - 2

    我看到这个错误:translationmissing:da.datetime.distance_in_words.about_x_hours我的语言环境文件:http://pastie.org/2944890我的看法:我已将其添加到我的application.rb中:config.i18n.load_path+=Dir[Rails.root.join('my','locales','*.{rb,yml}').to_s]config.i18n.default_locale=:da如果我删除I18配置,帮助程序会处理英语。更新:我在config/enviorments/devolpment

  3. ruby-on-rails - 如何在我的 Rails 应用程序 View 中打印 ruby​​ 变量的内容? - 2

    我是一个Rails初学者,但我想从我的RailsView(html.haml文件)中查看Ruby变量的内容。我试图在ruby​​中打印出变量(认为它会在终端中出现),但没有得到任何结果。有什么建议吗?我知道Rails调试器,但更喜欢使用inspect来打印我的变量。 最佳答案 您可以在View中使用puts方法将信息输出到服务器控制台。您应该能够在View中的任何位置使用Haml执行以下操作:-puts@my_variable.inspect 关于ruby-on-rails-如何在我的R

  4. ruby - 查找字符串中的内容类型(数字、日期、时间、字符串等) - 2

    我正在尝试解析一个CSV文件并使用SQL命令自动为其创建一个表。CSV中的第一行给出了列标题。但我需要推断每个列的类型。Ruby中是否有任何函数可以找到每个字段中内容的类型。例如,CSV行:"12012","Test","1233.22","12:21:22","10/10/2009"应该产生像这样的类型['integer','string','float','time','date']谢谢! 最佳答案 require'time'defto_something(str)if(num=Integer(str)rescueFloat(s

  5. Matlab imread()读到了什么 (浅显 当复习文档了) - 2

    matlab打开matlab,用最简单的imread方法读取一个图像clcclearimg_h=imread('hua.jpg');返回一个数组(矩阵),往往是a*b*cunit8类型解释一下这个三维数组的意思,行数、数和层数,unit8:指数据类型,无符号八位整形,可理解为0~2^8的数三个层数分别代表RGB三个通道图像rgb最常用的是24-位实现方法,即RGB每个通道有256色阶(2^8)。基于这样的24-位RGB模型的色彩空间可以表现256×256×256≈1670万色当imshow传入了一个二维数组,它将以灰度方式绘制;可以把图像拆分为rgb三层,可以以灰度的方式观察它figure(1

  6. ruby-on-rails - Rails - 从命名路由中提取 HTTP 动词 - 2

    Rails中有没有一种方法可以提取与路由关联的HTTP动词?例如,给定这样的路线:将“users”匹配到:“users#show”,通过:[:get,:post]我能实现这样的目标吗?users_path.respond_to?(:get)(显然#respond_to不是正确的方法)我最接近的是通过执行以下操作,但它似乎并不令人满意。Rails.application.routes.routes.named_routes["users"].constraints[:request_method]#=>/^GET$/对于上下文,我有一个设置cookie然后执行redirect_to:ba

  7. ruby-on-rails - Ruby - 如何从 ruby​​ 上的 .pfx 文件中提取公钥、rsa 私钥和 CA key - 2

    我有一个.pfx格式的证书,我需要使用ruby​​提取公共(public)、私有(private)和CA证书。使用shell我可以这样做:#ExtractPublicKey(askforpassword)opensslpkcs12-infile.pfx-outfile_public.pem-clcerts-nokeys#ExtractCertificateAuthorityKey(askforpassword)opensslpkcs12-infile.pfx-outfile_ca.pem-cacerts-nokeys#ExtractPrivateKey(askforpassword)o

  8. ruby - 如何使用 Selenium Webdriver 根据 div 的内容执行操作? - 2

    我有一个使用SeleniumWebdriver和Nokogiri的Ruby应用程序。我想选择一个类,然后对于那个类对应的每个div,我想根据div的内容执行一个Action。例如,我正在解析以下页面:https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies这是一个搜索结果页面,我正在寻找描述中包含“Adoption”一词的第一个结果。因此机器人应该寻找带有className:"result"的div,对于每个检查它的.descriptiondiv是否包含单词“adoption

  9. ruby-on-rails - 如何用不同的用户运行nginx主进程 - 2

    A/ctohttp://wiki.nginx.org/CoreModule#usermaster进程曾经以root用户运行,是否可以以不同的用户运行nginxmaster进程? 最佳答案 只需以非root身份运行init脚本(即/etc/init.d/nginxstart),就可以用不同的用户运行nginxmaster进程。如果这真的是你想要做的,你将需要确保日志和pid目录(通常是/var/log/nginx&/var/run/nginx.pid)对该用户是可写的,并且您所有的listen调用都是针对大于1024的端口(因为绑定(

  10. ruby - 如何在ruby中提取方括号内的内容 - 2

    我正在尝试提取方括号内的内容。到目前为止,我一直在使用它,它有效,但我想知道我是否可以直接在正则表达式中使用某些东西,而不是使用这个删除功能。a="Thisissuchagreatday[coolawesome]"a[/\[.*?\]/].delete('[]')#=>"coolawesome" 最佳答案 差不多。a="Thisissuchagreatday[coolawesome]"a[/\[(.*?)\]/,1]#=>"coolawesome"a[/(?"coolawesome"第一个依赖于提取组而不是完全匹配;第二个利用前瞻和

随机推荐