python课程设计是写个爬虫,并把结果并把结果数据可视化,想了半天也不知道爬啥,常见的爬评论,爬电影信息什么的实在没什么意义,于是就决定把之前写的爬虫+fofa批量检测狮子鱼sql注入 - xiaolong's blog (xiaolong22333.top)给完善一下,找到漏洞url后去站长工具和爱企查上把企业信息给爬下来,结果发现爱企查有反爬,且页面很奇怪,如果有几页结果的是可以找到pid的,但是如果就一个结果或者一页,找不到pid,总之就是寄了,浪费了我宝贵的一个下午。。。
那随便爬吧,爬一下自己博客各个标签的文章占比
# -*- coding: utf-8 -*-
'''
@File: blog.py
@Time: 2022/06/09 21:31:48
@Author: xiaolong
@Version: 1.0
@Link: https://xiaolong22333.top
'''
import requests
import urllib3
import re
import pymysql
import matplotlib.pyplot as plt
urllib3.disable_warnings() #忽略https证书告警
def insert():
for id in range(1,200):
req = requests.get(url=url.format(id))
if req.status_code == 200:
title = "".join(re.findall('.*?<title>(.*)<\/title>',req.text))
keywords = "".join(re.findall('.*?<meta name="keywords" content="(.*)" \/>',req.text))
print(title,keywords)
sql = "insert into blog(id,title,keywords) values (%s,%s,%s)"
cursor.execute(sql,(id,title,keywords))
conn.commit()
def select():
keywords = ['web','misc','php','ctf','python','漏洞复现','渗透测试','java','漏洞挖掘','src']
num = []
for keyword in keywords:
sql = "select count(*) from blog where keywords like '%{}%'".format(keyword)
cursor.execute(sql)
count = cursor.fetchone()
count = "".join(str(x) for x in count)
num.append(count)
plt.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus']=False
plt.subplot(132)
plt.title('占比图')
plt.pie(num,labels=keywords)
# plt.show()
plt.savefig("1.png",dpi = 1000)
if __name__ == '__main__':
url = 'https://xiaolong22333.top/index.php/archives/{}'
conn = pymysql.connect(host='localhost', port=3306, user='root', passwd='root', db="test", charset = 'utf8')
cursor = conn.cursor()
insert()
select()
cursor.close()
conn.close()
最终结果如下
属实有点丑。。。