excel - 通过 XML 读取 Word 文档的内容

coder 2024-07-01 原文

上下文

我正在尝试在 Excel 中构建一个 Word 文档浏览器来筛选大量文档(大约 1000 个)。

事实证明，打开 word 文档的过程相当缓慢(每个文档大约需要 4 秒，因此在这种情况下，查看所有项目需要 2 小时，这对于单个查询来说太慢了)，即使是禁用所有可能减慢打开速度的东西，因此我打开:

只读
没有打开和修复模式(这可能发生在某些文档上)
禁用文档的显示

到目前为止我的尝试

这些文档很难浏览，因为有些关键字每次都会出现，但不会出现在相同的上下文中(这不是问题的核心，因为我可以在将文本加载到数组中时处理它)。因此，经常使用的 Windows 资源管理器 解决方案(如 link 中的)不能用于我的情况。

目前，我设法拥有一个工作宏，通过打开它们来分析 word 文档的内容。

代码

这是代码示例。请注意，我使用了 Microsoft Word 14.0 Object Library 引用

' Analyzing all the word document within the same folder '
Sub extractFile()

Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document

Set oWordApp = CreateObject("Word.Application")

sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"

With Application
    .DisplayAlerts = False
    .ScreenUpdating = False
End With

ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"


While (sAnalyzedDoc <> "")
    ' Analyzing documents only with the .doc and .docx extension '
    If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
        ' Opening the document as mentionned above, in read only mode, without repair and invisible '
        Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
        With oDoc
            For i = 1 To .Sentences.Count
                ' Searching for the keyword within the document '
                If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
                    If Not IsEmpty(aOut(1, 2)) Then
                        ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
                    End If
                    aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
                    aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
                    GoTo closingDoc ' A dubious programming choice but that works for the moment '
                End If
            Next i
closingDoc:
            ' Intending to make the closing faster by not saving the document '
            .Close SaveChanges:=False
        End With
    End If
    'Moving on to the next document '
    sAnalyzedDoc = Dir
Wend

exitSub:
With Output
    .Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With

With Application
    .DisplayAlerts = True
    .ScreenUpdating = True
End With

End Sub

我的问题

我的想法是通过文档中的 XML 内容直接访问其内容(在较新版本的Word，带有 .zip 扩展名，并指向 nameOfDocument.zip\word\document.xml)。

这比加载所有在文本搜索中无用的图像、图表和表格要快得多。

因此，我想问问在 VBA 中是否有一种方法可以像 zip 文件一样打开 word 文档并访问该 XML 文档，然后像 VBA 中的普通字符串一样处理它，因为我已经有了上面代码给出的文件的路径和名称。

最佳答案

请注意，这不是上述问题的简单答案，只要您没有要浏览的文档负载，我最初问题中的唯一 VBA 代码就可以完美地完成工作，否则请继续另一个工具(有一个 Python Dynamic Link Library (DLL) 做得很好)。

好的，我会尽量让我的回答具有解释性。

首先，这个问题将我引向了 C# 和 XPath 中 XML 的无限旅程，我选择在某些时候不去追求它。

它将分析文件的时间从大约 2 小时减少到 10 秒。

上下文

读取 XML 文档以及内在 XML 文档的支柱是来自 Microsoft 的 OpenXML 库。请记住我上面所说的，我试图实现的方法不能仅在 VBA 中完成，因此必须以其他方式完成。这可能是因为 VBA 是为 Office 实现的，因此在访问 Office 文档的核心结构方面受到限制，但我没有与此限制相关的信息(欢迎提供任何信息)。

我在这里给出的答案是为 VBA 编写一个 C# DLL。为了在 C# 中编写 DLL 并在 VBA 中引用它，我将您重定向到以下链接，该链接将以更好的方式恢复此特定过程:Tutorial for creating DLL in C#

开始吧

首先，您需要在项目中引用 WindowsBase 库和 DocumentFormat.OpenXML 以使解决方案按照这篇 MSDN 文章 Manipulate Office Open XML Formats Documents 中的说明工作。还有那个Open and add text to a word processing document (Open XML SDK) 这些文章广泛地解释了 OpenXML 库如何处理 word 文档。

C#代码

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;

namespace BrowserClass
{

    public class SpecificDirectory
    {

        public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
        {
            string sKeyWord = nameKeyword;
            string sStopWord = nameStopword;
            string sDirectory = nameDirectory;

            sStopWord = sStopWord.ToLower();
            sKeyWord = sKeyWord.ToLower();

            string sDocPath = Path.GetDirectoryName(sDirectory);
            // Looking for all the documents with the .docx extension
            string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
            string[] sDocumentList = new string[1];
            string[] sDocumentText = new string[1];

            // Cycling the documents retrieved in the folder
            for (int i = 0; i < sDocName.Count(); i++)
            {
                string docWord = sDocName[i];

                // Opening the documents as read only, no need to edit them
                Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);

                const String officeDocRelType = @"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

                PackagePart corePart = null;
                Uri documentUri = null;

                // We are extracting the part with the document content within the files
                foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
                {
                    documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
                    corePart = officePackage.GetPart(documentUri);
                    break;
                }

                // Here enter the proper code
                if (corePart != null)
                {
                    string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
                    string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
                    string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";

                    // Construction of a namespace manager to handle the different parts of the xml files
                    NameTable nt = new NameTable();
                    XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
                    nsmgr.AddNamespace("dc", dcPropertiesSchema);
                    nsmgr.AddNamespace("cp", cpPropertiesSchema);
                    nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);

                    // Loading the xml document's text
                    XmlDocument doc = new XmlDocument(nt);
                    doc.Load(corePart.GetStream());

                    // I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
                    string docInnerText = doc.DocumentElement.InnerText;
                    docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
                    docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
                    docInnerText = docInnerText.Replace("Glossary.", "");

                    try
                    {
                        Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
                        Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);

                        if (iPosStopWord == -1)
                        {
                            iPosStopWord = docInnerText.Length;
                        }

                        if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
                        {
                            // Redimensions the array if there was already a document loaded
                            if (sDocumentList[0] != null)
                            {
                                Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
                                Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
                            }
                            sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
                            // Taking the small context around the keyword
                            sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
                        }

                    }
                    catch (ArgumentOutOfRangeException)
                    {
                        Console.WriteLine("Error reading inner text.");
                    }
                }
                // Closing the package to enable opening a document right after
                officePackage.Close();
            }

            if (sDocumentList[0] != null)
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[sDocumentList.Length, 2];

                for (int i = 0; i < sDocumentList.Length; i++)
                {
                    sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
                    sFinalArray[i, 1] = sDocumentText[i];
                }
                return sFinalArray;
            }
            else 
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[1, 1];
                sFinalArray[0, 0] = "NO MATCH";
                return sFinalArray;
            }
        }
    }

}

关联的VBA代码

Option Explicit

Const sLibname As String = "C:\pathToYourDocuments\"

Sub tester()

Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory

' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)

' Add here any necessary processing if needed '

End Sub

因此，最终您将获得一个扫描 .docx 文档的工具，该工具可以比 VBA 中经典的打开-读取-关闭方法更快地扫描，但代价是编写更多代码。

最重要的是，您为只想执行简单搜索的用户提供了一个简单的解决方案，尤其是在有大量 word 文档的情况下。

注意事项

正如@Mikegrann 所指出的，在 VBA 中解析 Word .XML 文件可能是一场噩梦。幸运的是 OpenXML 有一个 XML 解析器 C# , xml parsing. get data between tags这将在 C# 中为您完成工作并获取 <w:t></w:t>引用文档文本的标签。虽然到目前为止我找到了这些答案但无法使它们起作用: Parsing a MS Word generated XML file in C# , Reading specific XML elements from XML file

所以我选择了 .InnerText我在上面的代码中提供了解决方案，以访问内部文本，代价是输入一些格式化文本(如 \\MERGEFORMAT )。

关于excel - 通过 XML 读取 Word 文档的内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39125601/

excel Word 34 string the xml vba ms-word

有关excel - 通过 XML 读取 Word 文档的内容的更多相关文章

ruby - 如何将脚本文件的末尾读取为数据文件(Perl 或任何其他语言) - 2
我正在寻找执行以下操作的正确语法(在Perl、Shell或Ruby中):#variabletoaccessthedatalinesappendedasafileEND_OF_SCRIPT_MARKERrawdatastartshereanditcontinues. 最佳答案 Perl用__DATA__做这个:#!/usr/bin/perlusestrict;usewarnings;while(){print;}__DATA__Texttoprintgoeshere 关于ruby-如何将脚
ruby - 通过 rvm 升级 rubygems 的问题 - 2
尝试通过RVM将RubyGems升级到版本1.8.10并出现此错误:$rvmrubygemslatestRemovingoldRubygemsfiles...Installingrubygems-1.8.10forruby-1.9.2-p180...ERROR:Errorrunning'GEM_PATH="/Users/foo/.rvm/gems/ruby-1.9.2-p180:/Users/foo/.rvm/gems/ruby-1.9.2-p180@global:/Users/foo/.rvm/gems/ruby-1.9.2-p180:/Users/foo/.rvm/gems/rub
ruby - 将数组的内容转换为 int - 2
我需要读入一个包含数字列表的文件。此代码读取文件并将其放入二维数组中。现在我需要获取数组中所有数字的平均值，但我需要将数组的内容更改为int。有什么想法可以将to_i方法放在哪里吗？ClassTerraindefinitializefile_name@input=IO.readlines(file_name)#readinfile@size=@input[0].to_i@land=[@size]x=1whilex 最佳答案只需将数组映射为整数:@land边注如果你想得到一条线的平均值，你可以这样做:values=@input[x]
ruby-on-rails - 如何从 format.xml 中删除 <hash></hash> - 2
我有一个对象has_many应呈现为xml的子对象。这不是问题。我的问题是我创建了一个Hash包含此数据，就像解析器需要它一样。但是rails自动将整个文件包含在.........我需要摆脱type="array"和我该如何处理？我没有在文档中找到任何内容。最佳答案我遇到了同样的问题；这是我的XML:我在用这个:entries.to_xml将散列数据转换为XML，但这会将条目的数据包装到中所以我修改了:entries.to_xml(root:"Contacts")但这仍然将转换后的XML包装在“联系人”中，将我的XML代码修改为
ruby - 通过 erb 模板输出 ruby 数组 - 2
我正在使用puppet为ruby程序提供一组常量。我需要提供一组主机名，我的程序将对其进行迭代。在我之前使用的bash脚本中，我只是将它作为一个puppet变量hosts=>"host1,host2"我将其提供给bash脚本作为HOSTS=显然这对ruby不太适用——我需要它的格式hosts=["host1","host2"]自从phosts和putsmy_array.inspect提供输出["host1","host2"]我希望使用其中之一。不幸的是，我终其一生都无法弄清楚如何让它发挥作用。我尝试了以下各项:我发现某处他们指出我需要在函数调用前放置“function_”……这
Ruby 写入和读取对象到文件 - 2
好的，所以我的目标是轻松地将一些数据保存到磁盘以备后用。您如何简单地写入然后读取一个对象？所以如果我有一个简单的类classCattr_accessor:a,:bdefinitialize(a,b)@a,@b=a,bendend所以如果我从中非常快地制作一个objobj=C.new("foo","bar")#justgaveitsomerandomvalues然后我可以把它变成一个kindaidstring=obj.to_s#whichreturns""我终于可以将此字符串打印到文件或其他内容中。我的问题是，我该如何再次将这个id变回一个对象？我知道我可以自己挑选信息并制作一个接受该信
ruby - 通过 ruby 进程共享变量 - 2
我正在编写一个gem，我必须在其中fork两个启动两个webrick服务器的进程。我想通过基类的类方法启动这个服务器，因为应该只有这两个服务器在运行，而不是多个。在运行时，我想调用这两个服务器上的一些方法来更改变量。我的问题是，我无法通过基类的类方法访问fork的实例变量。此外，我不能在我的基类中使用线程，因为在幕后我正在使用另一个不是线程安全的库。所以我必须将每个服务器派生到它自己的进程。我用类变量试过了，比如@@server。但是当我试图通过基类访问这个变量时，它是nil。我读到在Ruby中不可能在分支之间共享类变量，对吗？那么，还有其他解决办法吗？我考虑过使用单例，但我不确定这是
ruby - 通过 RVM (OSX Mountain Lion) 安装 Ruby 2.0.0-p247 时遇到问题 - 2
我的最终目标是安装当前版本的RubyonRails。我在OSXMountainLion上运行。到目前为止，这是我的过程:已安装的RVM$\curl-Lhttps://get.rvm.io|bash-sstable检查已知(我假设已批准)安装$rvmlistknown我看到当前的稳定版本可用[ruby-]2.0.0[-p247]输入命令安装$rvminstall2.0.0-p247注意:我也试过这些安装命令$rvminstallruby-2.0.0-p247$rvminstallruby=2.0.0-p247我很快就无处可去了。结果:$rvminstall2.0.0-p247Search
ruby-on-rails - Enumerator.new 如何处理已通过的 block ？ - 2
我在理解Enumerator.new方法的工作原理时遇到了一些困难。假设文档中的示例:fib=Enumerator.newdo|y|a=b=1loopdoy[1,1,2,3,5,8,13,21,34,55]循环中断条件在哪里，它如何知道循环应该迭代多少次(因为它没有任何明确的中断条件并且看起来像无限循环)？最佳答案 Enumerator使用Fibers在内部。您的示例等效于:require'fiber'fiber=Fiber.newdoa=b=1loopdoFiber.yieldaa,b=b,a+bendend10.times.m
ruby-on-rails - 如何在我的 Rails 应用程序 View 中打印 ruby 变量的内容？ - 2
我是一个Rails初学者，但我想从我的RailsView(html.haml文件)中查看Ruby变量的内容。我试图在ruby中打印出变量(认为它会在终端中出现)，但没有得到任何结果。有什么建议吗？我知道Rails调试器，但更喜欢使用inspect来打印我的变量。最佳答案您可以在View中使用puts方法将信息输出到服务器控制台。您应该能够在View中的任何位置使用Haml执行以下操作:-puts@my_variable.inspect 关于ruby-on-rails-如何在我的R

excel - 通过 XML 读取 Word 文档的内容

上下文

开始吧

C#代码

关联的VBA代码

注意事项

有关excel - 通过 XML 读取 Word 文档的内容的更多相关文章

随机推荐