爬虫每日篇----使用爬虫多种姿势爬取学校OnlineJudge上的题目。学校的oj里有两千多道题，到访问的时候必须连校园网，在外面就看不到了，所以我想着把上面所有的题都弄下来，然后把所有题目集合到一起做一个.chm方便用的时候看。效果如下图。

学校OJ网站的配置很简单，普通的http协议，使用get请求就可以。直接上代码，有不懂的可以联系我的邮箱。

爬取1000-1050题号之间所有的题目

import urllib
import urllib2
import string
url = 'http://210.31.181.254/JudgeOnline/showproblem?problem_id='
for id in range(1000,1050):
    result = urllib2.urlopen(url+str(id))
    f = open('test/'+str(id)+'.html','w')
    f.write(result.read())
    f.close()

import urllib

import urllib2

import string

url = 'http://210.31.181.254/JudgeOnline/showproblem?problem_id='

for id in range(1000,1050):

result = urllib2.urlopen(url+str(id))

f = open('test/'+str(id)+'.html','w')

f.write(result.read())

f.close()

## 爬取对应账号做过的所有题目

import urllib
import urllib2
import string
import re
url = 'http://210.31.181.254/JudgeOnline/showproblem?problem_id='
login_url = 'http://210.31.181.254/JudgeOnline/login?action=login'
problem_list_url = 'http://210.31.181.254/JudgeOnline/problemlist'
user_url = 'http://210.31.181.254/JudgeOnline/userstatus?user_id='
user_id = 0;
data = urllib.urlencode({
    'user_id1':'2015xxxxxx',
    'password1':'xxxxxx',
    'B1':'login'
})
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
request = urllib2.Request(login_url,data,headers)
response = urllib2.urlopen(request)
print response.read()
user_id = 2015110;
result = urllib2.urlopen(user_url+str(user_id)).read()
result = re.findall('problem_id=?([\d][\d][\d][\d])',result)
for id in result:
    problem = urllib2.urlopen(url+id)
    f = open('test/'+id+'.html','w')
    f.write(problem.read())
    f.close()
# &lt;a href.*problem_id=.*&gt;
# f = open('thefile.txt','w')
# f.write(result.)
# f.close()
print result

import urllib

import urllib2

import string

import re

url = 'http://210.31.181.254/JudgeOnline/showproblem?problem_id='

login_url = 'http://210.31.181.254/JudgeOnline/login?action=login'

problem_list_url = 'http://210.31.181.254/JudgeOnline/problemlist'

user_url = 'http://210.31.181.254/JudgeOnline/userstatus?user_id='

user_id = 0;

data = urllib.urlencode({

'user_id1':'2015xxxxxx',

'password1':'xxxxxx',

'B1':'login'

})

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headers = {'User-Agent':user_agent}

request = urllib2.Request(login_url,data,headers)

response = urllib2.urlopen(request)

print response.read()

user_id = 2015110;

result = urllib2.urlopen(user_url+str(user_id)).read()

result = re.findall('problem_id=?([\d][\d][\d][\d])',result)

for id in result:

problem = urllib2.urlopen(url+id)

f = open('test/'+id+'.html','w')

f.write(problem.read())

f.close()

# <a href.*problem_id=.*>

# f = open('thefile.txt','w')

# f.write(result.)

# f.close()

print result

使用爬虫多种姿势爬取学校OnlineJudge上的题目

学校OJ网站的配置很简单，普通的http协议，使用get请求就可以。直接上代码，有不懂的可以联系我的邮箱。

爬取1000-1050题号之间所有的题目

## 爬取对应账号做过的所有题目

下面是效果图

发表回复取消回复

学校OJ网站的配置很简单，普通的http协议，使用get请求就可以。直接上代码，有不懂的可以联系我的邮箱。

爬取1000-1050题号之间所有的题目

## 爬取对应账号做过的所有题目

下面是效果图

发表回复 取消回复

发表回复取消回复