python 爬取搜狐新闻

2018-05-19 本文已影响0人还有半个小时

python2.7,通过urllib2和BeautifulSoup爬取新闻

文中还包括一些BeautifulSoup的内置函数

# -*- coding:utf-8 -*-

import urllib2

from bs4import BeautifulSoup

import re

# 爬取搜狐新闻

url='http://news.sohu.com/'

header_={'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'}

# 将url包装成一个请求

res=urllib2.Request(url,headers=header_)

try:

# 将url请求返回給resp

resp=urllib2.urlopen(res)

except urllib2.URLError,e:

print e

#读取resp

info=resp.read()

# print(info)

# 将都取出来的数据转化为BeautifulSoup类型

soup=BeautifulSoup(info,'lxml')

# print type(soup)

#这里的格式只能获取这些标签的第一个

# print soup.title

# print soup.head

# print type(soup.script)

# print soup.a

# get方法用于得到标签下的属性值

# print soup.a

# print soup.a.get('href')#得到第一个a标签下的href属性

# string

# 得到标签下的文本内容，只有在此标签下没有子标签，或者只有一个子标签的情况下才能返回其中的内容，否则返回的是None

# print soup.a.string

# print soup.script.string

# print soup.script

# get_text() 可以获得一个标签中的所有文本内容，包括子孙节点的内容，这是最常用的方法

# print soup.select('a')[2]

# find_all和select都是查找，find_all是用于搜索节点中所有符合过滤条件的节点，

# for title in soup.find_all('a'):

# print title.get_text()

# select 方法返回的结果都是列表形式，可以遍历形式输出

# 通过属性查找，select查找class属性通过‘.focus-news’，id通过'#'标识

#实现代码

for titlein soup.select('.focus-news'):

# print title

for iin range(len(title.select('a'))):

f=title.select('a')[i].get_text()

# vals = f.strip("\n").split("\t")

print f.strip(),

print title.select('a')[0]['href']

# 通过属性查找，这里的class必须用class_传入参数，因为class是python中的关键词

# for title in soup.find_all(class_='focus-news'):

# print title.get_text()

python 爬取搜狐新闻

python2.7,通过urllib2和BeautifulSoup爬取新闻

文中还包括一些BeautifulSoup的内置函数

猜你喜欢

热点阅读