php - 在 HTML 中标记文本

coder 2024-04-24 原文

我有一些纯文本和 html。我需要创建一个将返回相同 html 的 PHP 方法，但带有 <span class="marked">在任何文本实例和 </span> 之前之后。

请注意，它应该支持 html 中的标签(例如，如果文本是 blabla，那么它应该标记为 bla<b>bla</b> 或 <a href="http://abc.com">bla</a>bla。

它应该区分大小写并支持长文本(多行等)。

例如，如果我用文本“my name is josh”和以下 html 调用此函数:

<html>
<head>
    <title>My Name Is Josh!!!</title>
</head>
<body>
    <h1>my name is <b>josh</b></h1>
    <div>
        <a href="http://www.names.com">my name</a> is josh
    </div>

    <u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span>.
</body>
</html>

...它应该返回:

<html>
<head>
    <title><span class="marked">My Name Is Josh</span>!!!</title>
</head>
<body>
    <h1><span class="marked">my name is <b>josh</b></span></h1>
    <div>
        <span class="marked"><a href="http://www.names.com">my name</a> is josh</span>
    </div>

    <span class="marked"><u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span></span>.
</body>
</html>

谢谢。

最佳答案

这会很棘手。

虽然您可以通过简单的正则表达式黑客攻击来做到这一点，但忽略标签内的任何内容，就像天真一样:

preg_replace(
    'My(<[^>]>)*\s+(<[^>]>)*name(<[^>]>)*\s+(<[^>]>)*is(<[^>]>)*\s+(<[^>]>)*Josh',
    '<span class="marked">$0</span>', $html
)

这一点都不可靠。部分原因是 HTML 不能用正则表达式解析:输入 > 是有效的在属性值中，其他非元素构造如注释将被错误解析。即使使用更严格的表达式来匹配标签⟩—⟩像<[^>\s]*(\s+([^>\s]+(\s*=\s*([^"'\s>][\s>]*|"[^"]*"|'[^']*')\s*))?)*\s*\/?>这样非常笨拙的东西，您仍然会遇到许多相同的问题，尤其是在不能保证输入 HTML 有效的情况下。

这甚至可能是一个安全问题，就好像您正在处理的 HTML 不受信任，它可能会欺骗您的解析器将文本内容转换为属性，从而导致脚本注入(inject)。

但即使忽略这一点，您也无法确保正确的元素嵌套。所以你可能会:

<em>My name is <strong>Josh</strong>!!!</em>

进入错误嵌套和无效:

<span class="marked"><em>My name is <strong>Josh</strong></span>!!!</em>

或:

My
<table><tr><td>name is</td></tr></table>
Josh

那些元素不能用 span 包裹的地方。如果你不走运，浏览器修复程序来“纠正”你的无效输出可能最终会留下一半的页面“标记”，或者弄乱页面布局。

因此，您必须在已解析的 DOM 级别上执行此操作，而不是使用字符串破解。您可以使用 PHP 解析整个字符串，处理它并重新序列化，但如果从可访问性的角度来看它是可以接受的，那么在浏览器端用 JavaScript 执行它可能会更容易，因为内容已经被解析为DOM 节点。

还是挺难的。 This question在文本全部位于同一个文本节点内的情况下处理它，但这是一个简单得多的情况。

您实际上必须做的是:

for each Element that may contain a <span>:
    for each child node in the element:
       generate the text content of this node and all following siblings
       match the target string/regex against the whole text
       if there is no match:
           break the outer loop - on to the next element.
       if the current node is an element node and the index of the match is not 0:
           break the inner loop - on to the next sibling node
       if the current node is a text node and the index of the match is > the length of the Text node data:
           break the inner loop - on to the next sibling node
       // now we have to find the position of the end of the match
       n is the length of the match string
       iterate through the remaining text node data and sibling text content:
           compare the length of the text content with n
           less?:
               subtract length from n and continue
           same?:
               we've got a match on a node boundary
               split the first text node if necessary
               insert a new span into the document
               move all the nodes from the first text node to this boundary inside the span
               break to outer loop, next element
           greater?:
               we've got a match ending inside the node.
               is the node a text node?:
                   then we can split the text node
                   also split the first text node if necessary
                   insert a new span into the document
                   move all contained nodes inside the span
                   break to outer loop, next element
               no, an element?:
                   oh dear! We can't insert a span here

哎呀。

如果可以接受单独包装作为匹配一部分的每个文本节点，那么这里有一个稍微不那么讨厌的替代建议。所以:

<p>Oh, my</p> name <div><div>is</div><div> Josh

会给你留下输出:

<p>Oh, <span class="marked">my</span></p>
<span class="marked"> name </span>
<div><div><span class="marked">is</span></div></div>
<span class="marked"> Josh</span>

这可能看起来不错，具体取决于您如何设置火柴的样式。它还可以解决部分匹配元素内部的错误嵌套问题。

ETA:哦，该死的伪代码，我现在或多或少已经写了代码，不妨完成它。这是后一种方法的 JavaScript 版本:

markTextInElement(document.body, /My\s+name\s+is\s+Josh/gi);


function markTextInElement(element, regexp) {
    var nodes= [];
    collectTextNodes(nodes, element);
    var datas= nodes.map(function(node) { return node.data; });
    var text= datas.join('');

    // Get list of [startnodei, startindex, endnodei, endindex] matches
    //
    var matches= [], match;
    while (match= regexp.exec(text)) {
        var p0= getPositionInStrings(datas, match.index, false);
        var p1= getPositionInStrings(datas, match.index+match[0].length, true);
        matches.push([p0[0], p0[1], p1[0], p1[1]]);
    }

    // Get list of nodes for each match, splitted at the edges of the
    // text. Reverse-iterate to avoid the splitting changing nodes we
    // have yet to process.
    //
    for (var i= matches.length; i-->0;) {
        var ni0= matches[i][0], ix0= matches[i][1], ni1= matches[i][2], ix1= matches[i][3];
        var mnodes= nodes.slice(ni0, ni1+1);
        if (ix1<nodes[ni1].length)
            nodes[ni1].splitText(ix1);
        if (ix0>0)
            mnodes[0]= nodes[ni0].splitText(ix0);

        // Replace each text node in the sublist with a wrapped version
        //
        mnodes.forEach(function(node) {
            var span= document.createElement('span');
            span.className= 'marked';
            node.parentNode.replaceChild(span, node);
            span.appendChild(node);
        });
    }
}

function collectTextNodes(texts, element) {
    var textok= [
        'applet', 'col', 'colgroup', 'dl', 'iframe', 'map', 'object', 'ol',
        'optgroup', 'option', 'script', 'select', 'style', 'table',
        'tbody', 'textarea', 'tfoot', 'thead', 'tr', 'ul'
    ].indexOf(element.tagName.toLowerCase()===-1)
    for (var i= 0; i<element.childNodes.length; i++) {
        var child= element.childNodes[i];
        if (child.nodeType===3 && textok)
            texts.push(child);
        if (child.nodeType===1)
            collectTextNodes(texts, child);
    };
}

function getPositionInStrings(strs, index, toend) {
    var ix= 0;
    for (var i= 0; i<strs.length; i++) {
        var n= index-ix, l= strs[i].length;
        if (toend? l>=n : l>n)
            return [i, n];
        ix+= l;
    }
    return [i, 0];
}


// We've used a few ECMAScript Fifth Edition Array features.
// Make them work in browsers that don't support them natively.
//
if (!('indexOf' in Array.prototype)) {
    Array.prototype.indexOf= function(find, i /*opt*/) {
        if (i===undefined) i= 0;
        if (i<0) i+= this.length;
        if (i<0) i= 0;
        for (var n= this.length; i<n; i++)
            if (i in this && this[i]===find)
                return i;
        return -1;
    };
}
if (!('forEach' in Array.prototype)) {
    Array.prototype.forEach= function(action, that /*opt*/) {
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                action.call(that, this[i], i, this);
    };
}
if (!('map' in Array.prototype)) {
    Array.prototype.map= function(mapper, that /*opt*/) {
        var other= new Array(this.length);
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                other[i]= mapper.call(that, this[i], i, this);
        return other;
    };
}

关于php - 在 HTML 中标记文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2843773/

中标 HTML gt lt span php

有关php - 在 HTML 中标记文本的更多相关文章

ruby - 使用 ruby 将 HTML 转换为纯文本并维护结构/格式 - 2
我想将html转换为纯文本。不过，我不想只删除标签，我想智能地保留尽可能多的格式。为插入换行符标签，检测段落并格式化它们等。输入非常简单，通常是格式良好的html(不是整个文档，只是一堆内容，通常没有anchor或图像)。我可以将几个正则表达式放在一起，让我达到80%，但我认为可能有一些现有的解决方案更智能。最佳答案首先，不要尝试为此使用正则表达式。很有可能你会想出一个脆弱/脆弱的解决方案，它会随着HTML的变化而崩溃，或者很难管理和维护。您可以使用Nokogiri快速解析HTML并提取文本:require'nokogiri'h
ruby-on-rails - Rails HTML 请求渲染 JSON - 2
在我的Controller中，我通过以下方式在我的index方法中支持HTML和JSON:respond_todo|format|format.htmlformat.json{renderjson:@user}end在浏览器中拉起它时，它会自然地以HTML呈现。但是，当我对/user资源进行内容类型为application/json的curl调用时(因为它是索引方法)，我仍然将HTML作为响应。如何获取JSON作为响应？我还需要说明什么？最佳答案您应该将.json附加到请求的url，提供的格式在routes.rb的路径中定义。这
ruby-on-rails - 使用 Sublime Text 3 突出显示 HTML 背景语法中的 ERB？ - 2
所以我在关注Railscast，我注意到在html.erb文件中，ruby代码有一个微弱的背景高亮效果，以区别于其他代码HTML文档。我知道Ryan使用TextMate。我正在使用SublimeText3。我怎样才能达到同样的效果？谢谢! 最佳答案为SublimeText安装ERB包。假设您安装了SublimeText包管理器*，只需点击cmd+shift+P即可获得命令菜单，然后键入installpackage并选择PackageControl:InstallPackage获取包管理器菜单。在该菜单中，键入ERB并在看到包时选择
ruby-on-rails - Ruby url 到 html 链接转换 - 2
我正在使用Rails构建一个简单的聊天应用程序。当用户输入url时，我希望将其输出为html链接(即“url”)。我想知道在Ruby中是否有任何库或众所周知的方法可以做到这一点。如果没有，我有一些不错的正则表达式示例代码可以使用... 最佳答案查看auto_linkRails提供的辅助方法。这会将所有URL和电子邮件地址变成可点击的链接(htmlanchor标记)。这是文档中的代码示例。auto_link("Gotohttp://www.rubyonrails.organdsayhellotodavid@loudthinking.
ruby-on-rails - capybara ::ElementNotFound:无法找到 xpath "/html" - 2
我正在学习http://ruby.railstutorial.org/chapters/static-pages上的RubyonRails教程并遇到以下错误StaticPagesHomepageshouldhavethecontent'SampleApp'Failure/Error:page.shouldhave_content('SampleApp')Capybara::ElementNotFound:Unabletofindxpath"/html"#(eval):2:in`text'#./spec/requests/static_pages_spec.rb:7:in`(root)'
ruby - 如何使用 Ruby 将 CSV 文件读入 HTML 表格？ - 2
我正在尝试将一个简单的CSV文件读入HTML表格以在浏览器中显示，但我遇到了麻烦。这就是我正在尝试的:Controller:defshow@csv=CSV.open("file.csv",:headers=>true)end查看:输出:NameStartDateEndDateQuantityPostalCode基本上我只获取标题，而不会读取和呈现CSV正文。最佳答案这最终成为最终解决方案:Controller:defshow#OpenaCSVfile,andthenreaditintoaCSV::Tableobjectforda
ruby - 如何使用 Nokogiri 解析纯 HTML 表格？ - 2
我想用Nokogiri解析HTML页面。页面的一部分有一个表，它没有使用任何特定的ID。是否可以提取如下内容:Today,3,455,34Today,1,1300,3664Today,10,100000,3444,Yesterday,3454,5656,3Yesterday,3545,1000,10Yesterday,3411,36223,15来自这个HTML:TodayYesterdayQntySizeLengthLengthSizeQnty345534345456563113003664354510001010100000344434113622315
ruby-on-rails - 连接字符串时如何在 <%=%> block 内输出 html_safe？ - 2
考虑一下:现在这些情况:#output:http://domain.com/?foo=1&bar=2#output:http://domain.com/?foo=1&bar=2#output:http://domain.com/?foo=1&bar=2#output:http://domain.com/?foo=1&bar=2我需要用其他字符串输出URL。我如何保证＆符号不会被转义？由于我无法控制的原因，我无法发送&。求助!把我的头发拉到这里:\编辑:为了澄清，我实际上有一个像这样的数组:@images=[{:id=>"fooid",:url=>"http://
ruby-on-rails - rspec - 我怎样才能让 "pendings"有我的文本而不仅仅是 "No reason given" - 2
我有这个代码:context"Visitingtheusers#indexpage."dobefore(:each){visitusers_path}subject{page}pending('iii'){shouldhave_no_css('table#users')}pending{shouldhavecontent('Youhavereachedthispageduetoapermissionic错误')}它会导致几个待处理，例如ManagingUsersGivenapractitionerloggedin.Visitingtheusers#indexpage.#Noreason
ruby - 如何为 pbcopy 生成富文本链接 - 2
我一直在玩一个脚本，它在Chrome中获取选定的文本并在Google中查找它，提供四个最佳选择，然后粘贴相关链接。它以不同的格式粘贴，具体取决于当前在Chrome中打开的页面-DokuWiki打开的DokuWiki格式，普通网站的HTML，我想要我的WordPress所见即所得编辑器的富文本。我尝试使用pbpaste-Preferrtf来查看没有其他样式的富文本链接在粘贴板上的样子，但它仍然输出纯文本。在文本编辑中保存文件并进行试验后，我想出了以下内容text=%q|{\rtf1{\field{\*\fldinst{HYPERLINK"URL"}}{\fldrsltTEXT}}}|te

php - 在 HTML 中标记文本

有关php - 在 HTML 中标记文本的更多相关文章

随机推荐