python3过滤标签_怎样用正则表达式过滤掉页面中除了<p></p>和<img>以外所有的标签

Ⅰ 如何用python过滤html标签和准确的提取内容

可以参考这个实例，代码中有过滤html标签及提取内容：

Python网页爬虫入门——抓取网络贴吧内容实例
http://lovesoo.org/getting-started-python-web-crawler-to-crawl-the--post-bar-content-instance.html

Ⅱ Python3.6.3 中BeautifSoup过滤标签中的文本

直接span.string就可以取出代码里的字符串，包括中文

你在for循环那里，最后两行去掉，用print(six.string)代替就行

Ⅲ python 怎么过滤特殊字符

#coding:utf-8
defcolate(st="你要过滤的字符串",ch='你要过滤的特殊字符'):
return''.join(st.split(ch))
#如果要过滤多个特殊字符的话，可以多次调用这个函数

Ⅳ python 爬虫怎么过滤正文以外的

利用bs4查找所有的div，用正则筛选出每个div里面的中文，找到中文字数最多的div就是属于正文的div了。定义一个抓取的头部抓取网页内容：

importrequests
headers={
'User-Agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/47.0.2526.106Safari/537.36',
'Host':'blog.csdn.net'}
session=requests.session()

defgetHtmlByRequests(url):
headers.update(
dict(Referer=url,Accept="*/*",Connection="keep-alive"))
htmlContent=session.get(url=url,headers=headers).content
returnhtmlContent.decode("utf-8","ignore")

统计文字的正则：

importre
#统计中文字数
defcountContent(string):
pattern=re.compile(u'[u1100-uFFFD]+?')
content=pattern.findall(string)
returncontent

查找每一个div，统计每一个div的文字，只保留文字最多的那个div：

#分析页面信息
defanalyzeHtml(html):
#初始化网页
soup=BeautifulSoup(html,"html.parser")
part=soup.select('div')
match=""
forparagraphinpart:
content=countContent(str(paragraph))
iflen(content)>len(match):
match=str(paragraph)
returnmatch

最后的调用几个函数即可：

defmain():
url="http://blog.csdn.net/"
html=getHtmlByRequests(url)
mainContent=analyzeHtml(html)
soup=BeautifulSoup(mainContent,"html.parser")
print(soup.select('div')[0].text)

Ⅳ 怎样用正则表达式过滤掉页面中除了<p></p>和<img>以外所有的标签

这个还真不容易实现，单独保留p或者img都可以，但是两个条件放一起就不行了。于专是我换属了一种思路，用了个函数实现了，你看下，代码是python下的：

importre

t='<html>asdfasdf<head>1111111111<body><p>asdfasdfasdf</p><imgherf="fff">'
defreplace_two(m):
"""
#过滤掉页面中除了<p></p>和<img>以外所有的标签
"""
all=re.findall(r'</?.*?>',m)
save=re.findall(r'</?(?:img).*?>|</?[pP]*?>',m)

foreinall:
ifenotinsave:
m1=m.replace(e,'')
m=m1
returnm

printreplace_two(t)

Ⅵ python 去除html标签的几种方法

python去除html标签的几种方法，代码如下：

#!/usr/bin/python
#-*-coding:utf-8-*-
'''
Createdon2015-07-08
@author:Administrator
'''
importre

classFilterTag():
def__init__(self):
pass
deffilterHtmlTag(self,htmlStr):
'''
过滤html中的标签
:paramhtmlStr:html字符串或是网页源码
'''
self.htmlStr=htmlStr
#先过滤CDATA
re_cdata=re.compile('//]*//]]>',re.I)#匹配CDATA
re_script=re.compile('<s*script[^>]*>[^<]*<s*/s*scripts*>',re.I)#Script
re_style=re.compile('<s*style[^>]*>[^<]*<s*/s*styles*>',re.I)#style
re_br=re.compile('')#处理换行
re_h=re.compile(']*>')#HTML标签
re_comment=re.compile('')#HTML注释
s=re_cdata.sub('',htmlStr)#去掉CDATA
s=re_script.sub('',s)#去掉SCRIPT
s=re_style.sub('',s)#去掉style
s=re_br.sub('
',s)#将br转换为换行
blank_line=re.compile('
+')#去掉多余的空行
s=blank_line.sub('
',s)
s=re_h.sub('',s)#去掉HTML标签
s=re_comment.sub('',s)#去掉HTML注释
#去掉多余的空行
blank_line=re.compile('
+')
s=blank_line.sub('
',s)
filterTag=FilterTag()
s=filterTag.replaceCharEntity(s)#替换实体
prints

defreplaceCharEntity(self,htmlStr):
'''
替换html中常用的字符实体
使用正常的字符替换html中特殊的字符实体
可以添加新的字符实体到CHAR_ENTITIES中
CHAR_ENTITIES是一个字典前面是特殊字符实体后面是其对应的正常字符
:paramhtmlStr:
'''
self.htmlStr=htmlStr
CHAR_ENTITIES={'nbsp':'','160':'',
'lt':'<','60':'<',
'gt':'>','62':'>',
'amp':'&','38':'&',
'quot':'"','34':'"',}
re_charEntity=re.compile(r'&#?(?Pw+);')
sz=re_charEntity.search(htmlStr)
whilesz:
entity=sz.group()#entity全称，如>
key=sz.group('name')#去除&;后的字符如（""--->key="nbsp"）去除&;后entity,如>为gt
try:
htmlStr=re_charEntity.sub(CHAR_ENTITIES[key],htmlStr,1)
sz=re_charEntity.search(htmlStr)
exceptKeyError:
#以空串代替
htmlStr=re_charEntity.sub('',htmlStr,1)
sz=re_charEntity.search(htmlStr)
returnhtmlStr

defreplace(self,s,re_exp,repl_string):
returnre_exp.sub(repl_string)


defstrip_tags(self,htmlStr):
'''
使用HTMLParser进行html标签过滤
:paramhtmlStr:
'''
self.htmlStr=htmlStr
htmlStr=htmlStr.strip()
htmlStr=htmlStr.strip("
")
result=[]
parser=HTMLParser()
parser.handle_data=result.append
parser.feed(htmlStr)
parser.close()
return''.join(result)

defstripTagSimple(self,htmlStr):
'''
最简单的过滤html<>标签的方法注意必须是<任意字符>而不能单纯是<>
:paramhtmlStr:
'''
self.htmlStr=htmlStr
#dr=re.compile(r'<[^>]+>',re.S)
dr=re.compile(r']*>',re.S)
htmlStr=re.sub(dr,'',htmlStr)
returnhtmlStr

if__name__=='__main__':
#s=file('Google.html').read()
filters=FilterTag()
printfilters.stripTagSimple("<1>你好")

Ⅶ python3怎样过滤字符串中的表情

importre

emoji_pattern=re.compile(
u"(ud83d[ude00-ude4f])|"#emoticons
u"(ud83c[udf00-uffff])|"#symbols&pictographs(1of2)
u"(ud83d[u0000-uddff])|"#symbols&pictographs(2of2)
u"(ud83d[ude80-udeff])|"#transport&mapsymbols
u"(ud83c[udde0-uddff])"#flags(iOS)
"+",flags=re.UNICODE)defremove_emoji(text):
returnemoji_pattern.sub(r'',text)

来自：http://blog.csdn.net/orangleliu/article/details/67632628?utm_source=gold_browser_extension

上面那个有时不好用，

try:
#pythonUCS-4build的处理方式
highpoints=re.compile(u'[U00010000-U0010ffff]')
exceptre.error:
#pythonUCS-2build的处理方式
highpoints=re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')

resovle_value=highpoints.sub(u'??',src_string)

尝试一下这个。

导航:首页 > 净水问答 > python3过滤标签

python3过滤标签

与python3过滤标签相关的资料