亚洲欧美一区二区三区在线,高清视频一区二区三区,伊人情人综合网

本文介紹了使用 HTMLAgilityPack 僅提取頁面文本的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學(xué)習(xí)吧！

問題描述

好的，所以我對 HTMLAgilityPack 中使用的 XPath 查詢真的很陌生.

Ok so i am really new to XPath queries used in HTMLAgilityPack.

讓我們考慮這個頁面 http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you.我想要的是只提取頁面內(nèi)容而不是其他內(nèi)容.

So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.

為此，我首先刪除腳本和樣式標(biāo)簽.

So for that i first remove script and style tags.

Document = new HtmlDocument();
        Document.LoadHtml(page);
        TempString = new StringBuilder();
        foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
        {
            style.Remove();
        }
        foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
        {
            script.Remove();
        }

之后我嘗試使用//text() 來獲取所有文本節(jié)點(diǎn).

After that i am trying to use //text() to get all the text nodes.

foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("http://text()"))
        {
            TempString.AppendLine(node.InnerText);
        }

但是，我不僅得到了文本，而且還得到了許多/r/n 字符.

However not only i am not getting just text i am also getting numerous /r /n characters.

在這方面我需要一些指導(dǎo).

Please i require a little guidance in this regard.

推薦答案

如果你認(rèn)為 script 和 style 節(jié)點(diǎn)只有孩子的文本節(jié)點(diǎn)，你可以使用這個XPath 表達(dá)式獲取不在 script 或 style 標(biāo)記中的文本節(jié)點(diǎn)，這樣您就無需事先刪除節(jié)點(diǎn):

If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:

//*[not(self::script or self::style)]/text()

您可以使用 XPath 的 normalize-space() 進(jìn)一步排除純空格的文本節(jié)點(diǎn):

You can further exclude text nodes that are only whitespace using XPath's normalize-space():

//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]

或更短的

//*[not(self::script or self::style)]/text()[normalize-space()]

但您仍然會得到可能有前導(dǎo)或尾隨空格的文本節(jié)點(diǎn).這可以按照@aL3891 的建議在您的應(yīng)用程序中處理.

But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as @aL3891 suggests.

這篇關(guān)于使用 HTMLAgilityPack 僅提取頁面文本的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網(wǎng)！

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題，如果有圖片或者內(nèi)容侵犯了您的權(quán)益，請聯(lián)系我們刪除處理，感謝您的支持！

pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

使用 HTMLAgilityPack 僅提取頁面文本

問題描述

推薦答案

相關(guān)文檔推薦