写了个抓Tumblr大图url的python脚本

tumblrTumblr上的某个blog的所有图片最大分辨率的url导出, 方便使用其他工具批量下载.

从blogname.txt读入blog子域名, 每行一个. 输出文件是result.txt. 没有任何容错.
第一次用eclipse…
使用Tumblr API v1: http://www.tumblr.com/docs/en/api/v1
抓的什么图我都忘了了了了

import urllib
import re
extractpicre = re.compile(r'(?<=<photo-url max-width="1280">).+?(?=</photo-url>)',flags=re.S)   #search for url of maxium size of a picture, which starts with '<photo-url max-width="1280">' and ends with '</photo-url>'
inputfile = open('blogname.txt','r')    #input file for reading blog names (subdomains). one per line
outputfile = open('result.txt','w') #output for writing extracted urls
proxy= {'http':'http://127.0.0.1:8087'} #proxy setting for some reason
for blogname in inputfile:  #actions for every blog
    baseurl = 'http://'+blogname.strip()+'.tumblr.com/api/read?type=photo&num=50&start='    #url to start with
    start = 0   #start from num zero
    while True: #loop for fetching pages
        url = baseurl + str(start)  #url to fetch
        print url   #show fetching info
        pagecontent = urllib.urlopen(url,proxies = proxy).read()    #fetched content
        pics = extractpicre.findall(pagecontent)    #find all picture urls fit the regex
        for picurl in pics: #loop for writing urls
            outputfile.write(picurl + 'n') #write urls to text file
        if (len(pics) < 50):    #figure our if this is the last page. if less than 50 result were found 
            break   #end the loop of fetching pages
        else:   #find 50 result
            start += 50 #heading to next page
    print url   #show fetching info
inputfile.close()   #close the input file
outputfile.close()  #close the output file

(Tumblr要求使用大写T的名字…)
This application uses the Tumblr application programming interface but is not endorsed or certified by Tumblr, Inc. All of the Tumblr logos and trademarks displayed on this application are the property of Tumblr, Inc.

One thought on “写了个抓Tumblr大图url的python脚本”

Leave a Reply

Your email address will not be published.