把tumblrTumblr上的某个blog的所有图片最大分辨率的url导出, 方便使用其他工具批量下载.
从blogname.txt读入blog子域名, 每行一个. 输出文件是result.txt. 没有任何容错.
第一次用eclipse…
使用Tumblr API v1: http://www.tumblr.com/docs/en/api/v1
抓的什么图我都忘了了了了
import urllib import re extractpicre = re.compile(r'(?<=<photo-url max-width="1280">).+?(?=</photo-url>)',flags=re.S) #search for url of maxium size of a picture, which starts with '<photo-url max-width="1280">' and ends with '</photo-url>' inputfile = open('blogname.txt','r') #input file for reading blog names (subdomains). one per line outputfile = open('result.txt','w') #output for writing extracted urls proxy= {'http':'http://127.0.0.1:8087'} #proxy setting for some reason for blogname in inputfile: #actions for every blog baseurl = 'http://'+blogname.strip()+'.tumblr.com/api/read?type=photo&num=50&start=' #url to start with start = 0 #start from num zero while True: #loop for fetching pages url = baseurl + str(start) #url to fetch print url #show fetching info pagecontent = urllib.urlopen(url,proxies = proxy).read() #fetched content pics = extractpicre.findall(pagecontent) #find all picture urls fit the regex for picurl in pics: #loop for writing urls outputfile.write(picurl + 'n') #write urls to text file if (len(pics) < 50): #figure our if this is the last page. if less than 50 result were found break #end the loop of fetching pages else: #find 50 result start += 50 #heading to next page print url #show fetching info inputfile.close() #close the input file outputfile.close() #close the output file
(Tumblr要求使用大写T的名字…)
This application uses the Tumblr application programming interface but is not endorsed or certified by Tumblr, Inc. All of the Tumblr logos and trademarks displayed on this application are the property of Tumblr, Inc.
One thought on “写了个抓Tumblr大图url的python脚本”