小羊学编程之Python爬虫实例

python爬虫学习了一些时间，今天看到某网的文章，平时只是一部分一部分复制粘贴，有时甚至不让复制。于是想利用python爬虫下载文章保存到本地。哈哈。为了便于学习理解，直接上代码加注释。重点学习稳中有降模块的用法。
from urllib.request import urlopen
from bs4 import beautifulsoup #beautiful soup 是一个可以从html或xml文件中提取数据的python库
import html5lib #html5lib 是一个 ruby和 python用来解析 html文档的类库,支持html5
import time # python time时间模块
import os
import requests #requests是python的一个http客户端库
from time import sleep #,使用sleep函数可以让程序休眠延时。
def download_novel(html): #定义一个下载文章函数。
bsobj=beautifulsoup(html,'html5lib') #利用beautifulsoup析html页面
chapter=bsobj.find(p,{class,read-content j_readcontent}) #获取文章内容
title=bsobj.find(,{class,j_chaptername}) #获取文章标题。
print (chapter.get_text()) #打印出文章内容。
print (title) #打印出文章标题。
fo=open(d:/001.txt,a) #打开文件
fo.write(chapter.get_text())#写入文件
fo.close #close()方法用于关闭一个已打开的文件
bsoup=bsobj.find(a,{id:j_chapternext}) #获取下一章节文章内容
html2=http:+bsoup.get('href')+#获取下一章节文章url
return (urlopen(html2))
html=urlopen(https://xxxxxxxxxx/chapter/5889870403237101/15810501355231395)
i=1
while(i<10): #下载章节数目
html=download_novel(html)
i=i+1
start = time.time() #程序运行开始时间
download_novel(html)
sleep(1) #让程序延时
c = time.time() - start #程序运行结束时间
print('保存文章结束，运行共耗时:%0.2f'%(c))
运行结果如下图：
文件内容保存到d盘的001.txt文件中。
文中不妥之处请朋友们指正！谢谢

小羊学编程之Python爬虫实例

VIP推荐