jjzjj

xml - 用 R 解析 XML - 总是那么困难吗?

coder 2024-06-27 原文

我花了比预期更多的时间将 xml 传输到 dataframe(代码片段同时包含 xml 和 xmlTreeParse 以使 post 更小,整个解决方案在此之后):

users = xmlTreeParse(file=
'<?xml version="1.0" encoding="utf-8"?>
<users>
  <row Id="-1" Reputation="1" CreationDate="2010-07-19T06:55:26.860" DisplayName="Community" LastAccessDate="2010-07-19T06:55:26.860" Location="on the server farm" AboutMe="some text" Views="0" UpVotes="4382" DownVotes="771" EmailHash="a007be5a61f6aa8f3e85ae2fc18dd66e" />
  <row Id="2" Reputation="101" CreationDate="2010-07-19T14:01:36.697" DisplayName="Geoff Dalgas" LastAccessDate="2012-09-13T17:41:48.300" WebsiteUrl="http://stackoverflow.com" Location="Corvallis, OR" AboutMe="some text 2" Views="7" UpVotes="3" DownVotes="0" EmailHash="b437f461b3fd27387c5d8ab47a293d35" Age="36" />
  <row Id="3" Reputation="101" CreationDate="2010-07-19T15:34:50.507" DisplayName="Jarrod Dixon" LastAccessDate="2013-01-15T03:28:47.657" WebsiteUrl="http://stackoverflow.com" Location="New York, NY" AboutMe="some text 3" Views="9" UpVotes="19" DownVotes="0" EmailHash="2dfa19bf5dc5826c1fe54c2c049a1ff1" Age="34" />
  <row Id="4" Reputation="101" CreationDate="2010-07-19T19:03:27.400" DisplayName="Emmett" LastAccessDate="2013-04-16T16:51:04.780" WebsiteUrl="http://minesweeperonline.com" Location="New York, NY" AboutMe="some text 4" Views="3" UpVotes="0" DownVotes="0" EmailHash="129bc58fc3f1e3853cdd3cefc75fe1a0" Age="27" />
  <row Id="5" Reputation="6182" CreationDate="2010-07-19T19:03:57.227" DisplayName="Shane" LastAccessDate="2013-02-05T11:23:09.587" WebsiteUrl="http://www.statalgo.com" Location="New York, NY" AboutMe="some text 5" Views="605" UpVotes="659" DownVotes="5" EmailHash="0cee97ffd90277bf4ac753331d50af60" Age="34" />
  <row Id="6" Reputation="442" CreationDate="2010-07-19T19:04:07.647" DisplayName="Harlan" LastAccessDate="2013-05-09T13:11:29.027" WebsiteUrl="http://www.harlan.harris.name" Location="District of Columbia" AboutMe="some text 6" Views="30" UpVotes="42" DownVotes="0" EmailHash="9f1a68b9e623be5da422b44e733fa8bc" Age="40" />
  <row Id="7" Reputation="329" CreationDate="2010-07-19T19:04:37.257" DisplayName="Vince" LastAccessDate="2013-05-21T22:49:10.237" WebsiteUrl="http://bioinformatics.ucdavis.edu" Location="Davis, CA" AboutMe="some text 7" Views="21" UpVotes="14" DownVotes="0" EmailHash="4f7cebc8ac200d15bac5dcff51469425" Age="27" />
  <row Id="8" Reputation="6104" CreationDate="2010-07-19T19:04:52.280" DisplayName="csgillespie" LastAccessDate="2013-05-21T17:32:58.693" WebsiteUrl="http://www.mas.ncl.ac.uk/~ncsg3/" Location="Newcastle, United Kingdom" AboutMe="some text 8" Views="399" UpVotes="576" DownVotes="18" EmailHash="3c3eea4eda77ffe95ae18c78c3fc7e55" Age="35" />
  <row Id="10" Reputation="121" CreationDate="2010-07-19T19:05:40.403" DisplayName="Pierre" LastAccessDate="2012-10-04T17:17:01.430" WebsiteUrl="http://plindenbaum.blogspot.com" Location="France" AboutMe="some text 10" Views="8" UpVotes="2" DownVotes="0" EmailHash="61200477cf8983809ec152f484750204" Age="43" />
  <row Id="11" Reputation="136" CreationDate="2010-07-19T19:06:02.713" DisplayName="wahalulu" LastAccessDate="2013-05-26T20:36:24.567" WebsiteUrl="http://www.linkedin.com/in/marckvaisman" Location="Washington, DC" AboutMe="some text 11" Views="2" UpVotes="10" DownVotes="0" EmailHash="9a9a05e41ae6e3b127697967cea5f8fb" Age="39" />
  <row Id="12" Reputation="101" CreationDate="2010-07-19T19:06:34.507" DisplayName="Jin" LastAccessDate="2013-04-11T18:31:58.360" WebsiteUrl="http://www.8164.org" Location="Raleigh, NC" AboutMe="some text 12" Views="5" UpVotes="4" DownVotes="0" EmailHash="70ad2c2830eb9a7753bd6312f3811e3e" Age="37" />
  <row Id="13" Reputation="677" CreationDate="2010-07-19T19:06:49.527" DisplayName="Sharpie" LastAccessDate="2012-01-02T22:55:04.743" WebsiteUrl="http://www.sharpsteen.net" Location="United States" AboutMe="Undergraduate studying Environmental Engineering and Applied Mathematics." Views="37" UpVotes="44" DownVotes="1" EmailHash="a52001938ed33a87334447413cc5beaa" Age="27" />
  <row Id="15" Reputation="11" CreationDate="2010-07-19T19:07:32.537" DisplayName="hannes.koller" LastAccessDate="2010-08-24T14:23:18.050" WebsiteUrl="http://soma.denkt.org" Location="Vienna, Austria" AboutMe="" Views="2" UpVotes="0" DownVotes="0" EmailHash="0ecd144e2f3d05e6ee6b89404d1d4c53" Age="34" />
  <row Id="16" Reputation="101" CreationDate="2010-07-19T19:08:13.957" DisplayName="slashnick" LastAccessDate="2010-08-19T20:40:59.080" Location="London, United Kingdom" Views="2" UpVotes="7" DownVotes="0" EmailHash="5691ff74e21c78cd1563b5123254cbd6" Age="30" />
  <row Id="17" Reputation="192" CreationDate="2010-07-19T19:08:28.243" DisplayName="Random" LastAccessDate="2010-09-10T07:34:36.123" AboutMe="" Views="6" UpVotes="13" DownVotes="1" EmailHash="5a3c78de1408aae57797dffd0782b992" />
  <row Id="18" Reputation="128" CreationDate="2010-07-19T19:08:29.070" DisplayName="grokus" LastAccessDate="2012-08-09T15:02:00.600" WebsiteUrl="http://wikipedia.org" Location="United States" AboutMe="about me 18" Views="6" UpVotes="16" DownVotes="0" EmailHash="7d1f931327bfab7b214758be17627adc" Age="43" />
  <row Id="19" Reputation="101" CreationDate="2010-07-19T19:08:45.250" DisplayName="Noah Snyder" LastAccessDate="2012-06-17T15:53:43.550" WebsiteUrl="http://sbseminar.wordpress.com" Location="New York, NY" AboutMe="about me 19" Views="11" UpVotes="2" DownVotes="0" EmailHash="895385d49eb1f04c5ee1f8d7734f3a62" Age="33" />
</users>',
          asText=TRUE)

XML 只是来自 stackexchange 数据转储的 Users 表的表示:

<users>
  <row Id=..... />
  <row Id=..... />
  .....
  <row Id=..... />
</users>

到数据帧的映射就像我映射表一样。这是为我完成工作的代码:

require(XML)
require(plyr)

# insert xmlTreeParse here

r = xmlRoot(users)

attrs = c('Id', 'Reputation', 'CreationDate', 'DisplayName', 'LastAccessDate',
          'WebsiteUrl', 'Location', 'AboutMe',  'Views', 'UpVotes', 'DownVotes', 
          'EmailHash', 'Age')

mapUserAttrs = function(x, colNames) {
  t = data.frame(as.integer(x['Id']), 
           as.integer(x['Reputation']), 
           strptime(x['CreationDate'], '%Y-%m-%dT%H:%M:%OS'), 
           as.character(x['DisplayName']), 
           strptime(x['LastAccessDate'], '%Y-%m-%dT%H:%M:%OS'), 
           as.character(x['WebsiteUrl']), 
           as.character(x['Location']), 
           as.character(x['AboutMe']),
           as.integer(x['Views']), 
           as.integer(x['UpVotes']), 
           as.integer(x['DownVotes']), 
           as.character(x['EmailHash']), 
           as.integer(x['Age']))
  names(t) = colNames
  return(t)
}

result = ldply(lapply(xmlChildren(r), xmlAttrs), mapUserAttrs, attrs)

对我来说它看起来太忙了 - 但我发现没有更好的方法来使用 XML 包以及我找到的大量示例和文档来完成任务。

我想知道是否有更简单(或更短)的方法来完成相同的任务?

最佳答案

您可以使用 XML 包中的 xmlToList 函数执行此操作,并且由于您的某些节点包含其他节点不包含的选项,因此您还需要plyr 包中的 rbind.fill 函数。

下面的代码行将您的 XML 转换为列表,遍历节点并将字符串转换为数据帧,然后将所有这些数据帧绑定(bind)在一起。

require(xml)
require(plyr)

out <- do.call("rbind.fill",
  lapply(xmlToList(users), 
    function(x) as.data.frame(as.list(x), stringsAsFactors = FALSE)))


head(out)
  Id Reputation            CreationDate  DisplayName          LastAccessDate             Location     AboutMe Views UpVotes DownVotes
1 -1          1 2010-07-19T06:55:26.860    Community 2010-07-19T06:55:26.860   on the server farm   some text     0    4382       771
2  2        101 2010-07-19T14:01:36.697 Geoff Dalgas 2012-09-13T17:41:48.300        Corvallis, OR some text 2     7       3         0
3  3        101 2010-07-19T15:34:50.507 Jarrod Dixon 2013-01-15T03:28:47.657         New York, NY some text 3     9      19         0
4  4        101 2010-07-19T19:03:27.400       Emmett 2013-04-16T16:51:04.780         New York, NY some text 4     3       0         0
5  5       6182 2010-07-19T19:03:57.227        Shane 2013-02-05T11:23:09.587         New York, NY some text 5   605     659         5
6  6        442 2010-07-19T19:04:07.647       Harlan 2013-05-09T13:11:29.027 District of Columbia some text 6    30      42         0
                         EmailHash                    WebsiteUrl  Age
1 a007be5a61f6aa8f3e85ae2fc18dd66e                          <NA> <NA>
2 b437f461b3fd27387c5d8ab47a293d35      http://stackoverflow.com   36
3 2dfa19bf5dc5826c1fe54c2c049a1ff1      http://stackoverflow.com   34
4 129bc58fc3f1e3853cdd3cefc75fe1a0  http://minesweeperonline.com   27
5 0cee97ffd90277bf4ac753331d50af60       http://www.statalgo.com   34
6 9f1a68b9e623be5da422b44e733fa8bc http://www.harlan.harris.name   40

编辑

生成的数据框将完全由字符向量组成。如果你想将这些向量转换为日期、日期时间、数字等,你要么一个一个地做,要么你可以编写一个函数来指定哪些类应该分配给具有特定名称的列,或者您可以编写一个函数来尝试从数据中推断出正确的类。下面是最后一个选项的示例:

giveClasses <- function(df, threshold = 0.1) {
  df_classes <- sapply(df, class)

  df_alpha <- sapply(df, function(x) {
    mean(grepl("[[:alpha:]]", x)) >= threshold}) &
    df_classes == "character"

  df_digits <- sapply(df, function(x) mean(grepl("\\d", x))) >= threshold &
    df_classes == "character" &
    !df_alpha

  df_percent <- sapply(df, function(x) mean(grepl("%", x))) >= threshold &
    df_classes == "character" &
    !df_alpha &
    df_digits

  df_digits[df_percent] <- FALSE

  df_decimal <- sapply(df, function(x) mean(grepl("\\.", x))) >= threshold &
    df_classes == "character" &
    !df_percent &
    df_digits &
    !df_alpha

  df_dates <- sapply(df, function(x) {
    mean(grepl(
      "^\\d{2,4}[[:punct:]]\\d{2}[[:punct:]]\\d{2,4}$", x)) >= threshold}) &
    df_classes == "character"

  df_datetime <- sapply(df, function(x) {
    mean(grepl(
      "^\\d{2,4}[[:punct:]]\\d{2}[[:punct:]]\\d{2,4}\\D\\d{2}:\\d{2}(:\\d{2})?(\\.\\d{1,})?$", x)) >= threshold}) &
    df_classes == "character"

  # convert character data to appropriate classes
  df_logical <- sapply(df, function(x) {
    y <- unique(na.omit(x))
    length(y) == 2 & 
      mean(grepl("^n", y, ignore.case = TRUE) |
          grepl("^y", y, ignore.case = TRUE)) == 1
  })

  df_digits[df_dates | df_datetime] <- FALSE

  df[,df_percent] <- lapply(df[,df_percent, drop = FALSE], function(x) {
    as.numeric(gsub("[^[:digit:].]", "", x)) / 100})

  df[,df_logical] <- lapply(df[,df_logical, drop = FALSE], function(x) {
    x[grep("^y", x, ignore.case = TRUE)] <- TRUE
    x[grep("^n", x, ignore.case = TRUE)] <- FALSE
    as.logical(x)
  })

  df[,df_decimal] <- lapply(df[,df_decimal, drop = FALSE], function(x) {
    as.numeric(gsub("[^[:digit:].]", "", x))})

  df[,df_digits] <- lapply(df[,df_digits, drop = FALSE], function(x) {
    as.integer(gsub("[^[:digit:]]", "", x))})

  df[,df_dates] <- lapply(df[,df_dates, drop = FALSE], function(x) {
    as.Date(x)})

  df[,df_datetime] <- lapply(df[,df_datetime, drop = FALSE], function(x) {
    strptime(x, '%Y-%m-%dT%H:%M:%OS')})

  df_ischaracter <- sapply(df, function(x) any(class(x) == "character"))

  df[,df_ischaracter] <- lapply(df[,df_ischaracter, drop = FALSE], function(x) {
    x <- gsub("^\\s+|\\s+$|(?<=\\s)\\s+", "", x, perl = TRUE)})

  df
}

如果该列中超过 90% 的值符合适合该类的模式,则上述函数会将一个类分配给该列。否则,它会将它们保留为字符。它解决了在您的示例数据集中找不到的模式——我只是从我正在处理的另一个项目中复制了代码。所以:

str(giveClasses(out))

'data.frame':   17 obs. of  13 variables:
 $ Id            : int  1 2 3 4 5 6 7 8 10 11 ...
 $ Reputation    : int  1 101 101 101 6182 442 329 6104 121 136 ...
 $ CreationDate  : POSIXlt, format: "2010-07-19 06:55:26" "2010-07-19 14:01:36" "2010-07-19 15:34:50" "2010-07-19 19:03:27" ...
 $ DisplayName   : chr  "Community" "Geoff Dalgas" "Jarrod Dixon" "Emmett" ...
 $ LastAccessDate: POSIXlt, format: "2010-07-19 06:55:26" "2012-09-13 17:41:48" "2013-01-15 03:28:47" "2013-04-16 16:51:04" ...
 $ Location      : chr  "on the server farm" "Corvallis, OR" "New York, NY" "New York, NY" ...
 $ AboutMe       : chr  "some text" "some text 2" "some text 3" "some text 4" ...
 $ Views         : int  0 7 9 3 605 30 21 399 8 2 ...
 $ UpVotes       : int  4382 3 19 0 659 42 14 576 2 10 ...
 $ DownVotes     : int  771 0 0 0 5 0 0 18 0 0 ...
 $ EmailHash     : chr  "a007be5a61f6aa8f3e85ae2fc18dd66e" "b437f461b3fd27387c5d8ab47a293d35" "2dfa19bf5dc5826c1fe54c2c049a1ff1" "129bc58fc3f1e3853cdd3cefc75fe1a0" ...
 $ WebsiteUrl    : chr  NA "http://stackoverflow.com" "http://stackoverflow.com" "http://minesweeperonline.com" ...
 $ Age           : int  NA 36 34 27 34 40 27 35 43 39 ...

关于xml - 用 R 解析 XML - 总是那么困难吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17091186/

有关xml - 用 R 解析 XML - 总是那么困难吗?的更多相关文章

  1. Ruby 解析字符串 - 2

    我有一个字符串input="maybe(thisis|thatwas)some((nice|ugly)(day|night)|(strange(weather|time)))"Ruby中解析该字符串的最佳方法是什么?我的意思是脚本应该能够像这样构建句子:maybethisissomeuglynightmaybethatwassomenicenightmaybethiswassomestrangetime等等,你明白了......我应该一个字符一个字符地读取字符串并构建一个带有堆栈的状态机来存储括号值以供以后计算,还是有更好的方法?也许为此目的准备了一个开箱即用的库?

  2. ruby - 解析 RDFa、微数据等的最佳方式是什么,使用统一的模式/词汇(例如 schema.org)存储和显示信息 - 2

    我主要使用Ruby来执行此操作,但到目前为止我的攻击计划如下:使用gemsrdf、rdf-rdfa和rdf-microdata或mida来解析给定任何URI的数据。我认为最好映射到像schema.org这样的统一模式,例如使用这个yaml文件,它试图描述数据词汇表和opengraph到schema.org之间的转换:#SchemaXtoschema.orgconversion#data-vocabularyDV:name:namestreet-address:streetAddressregion:addressRegionlocality:addressLocalityphoto:i

  3. ruby - 用逗号、双引号和编码解析 csv - 2

    我正在使用ruby​​1.9解析以下带有MacRoman字符的csv文件#encoding:ISO-8859-1#csv_parse.csvName,main-dialogue"Marceu","Giveittohimóhe,hiswife."我做了以下解析。require'csv'input_string=File.read("../csv_parse.rb").force_encoding("ISO-8859-1").encode("UTF-8")#=>"Name,main-dialogue\r\n\"Marceu\",\"Giveittohim\x97he,hiswife.\"\

  4. ruby-on-rails - 如何从 format.xml 中删除 <hash></hash> - 2

    我有一个对象has_many应呈现为xml的子对象。这不是问题。我的问题是我创建了一个Hash包含此数据,就像解析器需要它一样。但是rails自动将整个文件包含在.........我需要摆脱type="array"和我该如何处理?我没有在文档中找到任何内容。 最佳答案 我遇到了同样的问题;这是我的XML:我在用这个:entries.to_xml将散列数据转换为XML,但这会将条目的数据包装到中所以我修改了:entries.to_xml(root:"Contacts")但这仍然将转换后的XML包装在“联系人”中,将我的XML代码修改为

  5. ruby-on-rails - Rails - 乐观锁定总是触发 StaleObjectError 异常 - 2

    我正在学习Rails,并阅读了关于乐观锁的内容。我已将类型为integer的lock_version列添加到我的articles表中。但现在每当我第一次尝试更新记录时,我都会收到StaleObjectError异常。这是我的迁移:classAddLockVersionToArticle当我尝试通过Rails控制台更新文章时:article=Article.first=>#我这样做:article.title="newtitle"article.save我明白了:(0.3ms)begintransaction(0.3ms)UPDATE"articles"SET"title"='dwdwd

  6. ruby - 使 faSTLane 不那么冗长 - 2

    有没有办法配置(例如,可以使用Fastfile)或以更简洁的方式执行FaSTLane?它目前打印出很多信息,这些信息通常会使开发人员对警告和错误视而不见。主要问题是需要花费一些时间在大量无用消息中滚动和搜索黄色/红色文本,直到您了解发生了什么。默认设置会打印所有内容,令人惊讶的是甚至还有--verbosemode对于CLI,但我找不到任何相反的东西,例如--quiet模式。编辑:下面是一些我希望能够抑制的输出示例。考虑到我使用了来自gitrepo的Fastfile,gym、match、cocoapods、get_version_number、increment_version_numb

  7. ruby-on-rails - 我更新了 ruby​​ gems,现在到处都收到解析树错误和弃用警告! - 2

    简而言之错误:NOTE:Gem::SourceIndex#add_specisdeprecated,useSpecification.add_spec.Itwillberemovedonorafter2011-11-01.Gem::SourceIndex#add_speccalledfrom/opt/local/lib/ruby/site_ruby/1.8/rubygems/source_index.rb:91./opt/local/lib/ruby/gems/1.8/gems/rails-2.3.8/lib/rails/gem_dependency.rb:275:in`==':und

  8. ruby - 用 YAML.load 解析 json 安全吗? - 2

    我正在使用ruby2.1.0我有一个json文件。例如:test.json{"item":[{"apple":1},{"banana":2}]}用YAML.load加载这个文件安全吗?YAML.load(File.read('test.json'))我正在尝试加载一个json或yaml格式的文件。 最佳答案 YAML可以加载JSONYAML.load('{"something":"test","other":4}')=>{"something"=>"test","other"=>4}JSON将无法加载YAML。JSON.load("

  9. ruby - 如何使用 Nokogiri 解析纯 HTML 表格? - 2

    我想用Nokogiri解析HTML页面。页面的一部分有一个表,它没有使用任何特定的ID。是否可以提取如下内容:Today,3,455,34Today,1,1300,3664Today,10,100000,3444,Yesterday,3454,5656,3Yesterday,3545,1000,10Yesterday,3411,36223,15来自这个HTML:TodayYesterdayQntySizeLengthLengthSizeQnty345534345456563113003664354510001010100000344434113622315

  10. python - 帮我找到合适的 ruby​​/python 解析器生成器 - 2

    我使用的第一个解析器生成器是Parse::RecDescent,它的指南/教程很棒,但它最有用的功能是它的调试工具,特别是tracing功能(通过将$RD_TRACE设置为1来激活)。我正在寻找可以帮助您调试其规则的解析器生成器。问题是,它必须用python或ruby​​编写,并且具有详细模式/跟踪模式或非常有用的调试技术。有人知道这样的解析器生成器吗?编辑:当我说调试时,我并不是指调试python或ruby​​。我指的是调试解析器生成器,查看它在每一步都在做什么,查看它正在读取的每个字符,它试图匹配的规则。希望你明白这一点。赏金编辑:要赢得赏金,请展示一个解析器生成器框架,并说明它的

随机推荐