还在为拿不到官方病例数据而发愁吗?
WHO各国病例数据如下:
https://experience.arcgis.com/experience/685d0ace521648f8a5beeeee1b9125cd
我们的目的就是爬出这个图中的数据:
审查元素
首先我们随便点开一个国家的疫情情况:
这里以中国为例,点开后找到URL:
https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27CHINA%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true
Preview中可以看到:
就是我们想要的数据,但是他的时间格式我们没有见过,两两差分可以发现规律:
两个时期间相差864
上面是确证病例的URL,新增病例的如下:
https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27CHINA%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2CNewCase%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true
以几个国家为例,代码如下(这里暂时写了名字是的单个单词的国家):
#coding:utf-8
import urllib.request
import os
import pandas as pd
import jsonres = pd.DataFrame()
def Open(url):heads = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}req = urllib.request.Request(url, headers=heads)response = urllib.request.urlopen(url)html = response.read()return html.decode('utf-8')def conserve(html, name):global restime, confirm = [], []temp = pd.DataFrame(columns=['time', name])for i in html['features']:time.append(i['attributes']['DateOfDataEntry'])confirm.append(i['attributes']['cum_conf'])temp['time'] = timetemp[name] = confirmtemp = temp.set_index('time')res = pd.concat([res, temp], axis=1)def main():global resfor name in ['China', 'Italy', 'Spain', 'France', 'Germany', 'Switzerland', 'Netherlands', 'Norway', 'Belgium', 'Sweden', 'Australia', 'Brazil', 'Egypt']:print(name)url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27' + name + '%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true'html = json.loads(Open(utl))conserve(html, name)print('--------------------------------------------------------------------------')#America 单独拿出来name = 'America'url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27United%20States%20of%20America%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true'html = json.loads(Open(url))conserve(html, name)res['Datetime'] = pd.date_range(start='20200122', end='20200316')res.to_csv('conform.csv', encoding='utf_8_sig')
main()
经过简单的数据处理后的结果如下:
注意,如果res[‘Datetime’] = pd.date_range(start=‘20200122’, end=‘20200317’)这一行报错,原因是我在三月十七号写的,需要将20200317改成今天的日期
更新数据:
https://dashboards-dev.sprinklr.com/data/9043/global-covid19-who-gis.json