Re: [問題] 抓取氣象局網頁資料

作者celestialgod (天)

看板R_Language

標題Re: [問題] 抓取氣象局網頁資料

時間Mon May 2 14:29:49 2016

※ 引述《corel (可羅)》之銘言： : : ctrl + y 可以刪除一整行，請將不需要的內容刪除 : 文章分類提示: : - 問題: 當你想要問問題時，請使用這個類別 : : [問題類型]: : : 程式諮詢(我想用R 做某件事情，但是我不知道要怎麼用R 寫出來) : : [軟體熟悉度]: : 請把以下不需要的部份刪除 : 入門(寫過其他程式，只是對語法不熟悉) : [問題敘述]: : 想擷取 http://www.cwb.gov.tw/V7/climate/monthlyData/mD.htm : 天氣資料，但由於天氣的資料會依網頁上所選取的下拉式選單的值而有所變化 : 想請問各位先進，要如何讓R自動會帶值選取對應的資料? : 例如: 2013年3月阿里山的平均溫度為9.9度 : 2013年4月阿里山的平均溫度為11.3度 : 謝謝 : [環境敘述]: : : 請提供 sessionInfo() 的輸出結果， : 裡面含有所有你使用的作業系統、R 的版本和套件版本資訊， : 讓版友更容易找出錯誤 : : R version 3.2.4 Revised (2016-03-16 r70336) : Platform: x86_64-w64-mingw32/x64 (64-bit) : Running under: Windows 7 x64 (build 7601) Service Pack 1 : locale: : [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese : (Traditional)_Taiwan.950 LC_MONETARY=Chinese (Traditional)_Taiwan.950 : [4] LC_NUMERIC=C LC_TIME=Chinese : (Traditional)_Taiwan.950 : attached base packages: : [1] stats graphics grDevices utils datasets methods base : loaded via a namespace (and not attached): : [1] httr_1.1.0 magrittr_1.5 R6_2.1.1 tools_3.2.4 : RCurl_1.95-4.8 yaml_2.1.13 rappdirs_0.3 memoise_0.2.1 crayon_1.3.1 : swirl_2.3.1-2 : [11] stringi_1.0-1 stringr_1.0.0 digest_0.6.8 testthat_0.11.0 : bitops_1.0-6 好讀版：http://pastebin.com/DYaWwFQ2 library(stringi) library(stringr) library(xml2) library(pipeR) library(purrr) dat <- read_html("http://www.cwb.gov.tw/V7/climate/monthlyData/Data/mD201512.htm", "UTF-8") %>>% xml_find_all('//table[@class="Form00"]/tr') %>>% map(~ when(., identical(class(try({xml_find_one(., 'td')})), "try-error") ~ xml_find_all(., 'th'), ~ xml_find_all(., 'td') ) %>>% xml_text %>>% when( identical(.Platform$OS.type, "windows") ~ stri_conv(., "UTF-8", "BIG5"), ~ . ) %>>% str_replace_all("\\s", "") ) # 最上面的表頭 dat[[1]] # [1] "項目" "溫度(℃)" "雨量" # [4] "風速(公尺/秒)/風向(360°)/日期" "相對溼度(%)" "測站氣壓" # [7] "降水日數>=0.1毫米" "日照時數" # 下面的表格 do.call(rbind, dat[2:length(dat)]) ## 印出部分 # [,1] [,2] [,3] [,4] [,5] [,6] # [1,] "測站" "平均" "最高/日期" "最低/日期" "(毫米)" "最大十分鐘風" # [2,] "阿里山" "9.8" "18.0/5" "2.8/29" "56.8" "4.8/340.0/26" # [3,] "鞍部" "12.3" "23.7/22" "2.5/17" "245.7" "14.3/350.0/10" # [4,] "板橋" "18.8" "29.9/22" "11.1/17" "80.6" "7.1/70.0/19" # [5,] "成功" "21.1" "27.8/21" "14.5/17" "86.4" "9.8/30.0/31" # [6,] "嘉義" "19.8" "30.2/24" "10.4/19" "31.4" "8.4/20.0/10" 不再建議使用XML做parse XML的工具，盡量轉到xml2去 xml2至少有20%以上的效率提升，並且XML有一些memory leak的問題函數說明： stri_conv 轉換編碼 (這個函數是在stringi裡面) xml_find_all 用xpath找node (xml2) read_html 讀網頁 (xml2) when (purrr) 每一個是一個input是一個條件跟對應的output 例如： row %>>% when( identical(class(try({xml_find_one(., 'td')})), "try-error") ~ xml_find_all(., 'th') ~ xml_find_all(., 'td') ) 當identical(class(try({xml_find_one(., 'td')})), "try-error")成立就回傳 xml_find_all(row, 'th') 沒有條件的話就當成else，回傳xml_find_all(row, 'td') 裡面的.都是第一個input (就是%>>%左邊傳入的變數) 這個函數算是ifelse的強化版(ifelse不能輸出block)，可以直接在pipe裡面用不需要再存暫存變數中斷pipe了 (我可能已經是pipe到走火入魔了) map (purrr) 等同於lapply，是hadley跟rstudio另外用C++重寫的 str_replace_all 等同於gsub，只是第三個input改到第一個，方便pipe (這函數在stringr) %>>% 大體上與%>%一樣，但是有一些比較好用的新功能，此處沒用上就不說明了另外，作者也宣稱不會像%>%有時候會有input模糊的問題 (自行GOOGLE) -- R資料整理套件系列文： magrittr #1LhSWhpH (R_Language) http://tinyurl.com/j3ql84c data.table #1LhW7Tvj (R_Language) http://tinyurl.com/hr77hrn dplyr(上) #1LhpJCfB (R_Language) http://tinyurl.com/jtg4hau dplyr(下) #1Lhw8b-s (R_Language) tidyr #1Liqls1R (R_Language) http://tinyurl.com/jq3o2g3 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 180.218.152.118 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1462170593.A.3D2.html

推 corel: 感謝，R真是一個神奇的語言.... 05/02 14:45

※ 編輯: celestialgod (180.218.152.118), 05/02/2016 15:07:03