[python]detect webpage encode big5 utf-8

[python]detect webpage encode big5 utf-8 情境常常在我們使用urllib urlopen 後，我們fetch了一個網頁，但是使用print 或者是後續導到某些framework時卻發生， at 'utf8' codec can't decode byte 0xc1 in position 0: invalid start byte 類似這樣的錯誤訊息， (以上是從web.py回傳的) 因為一般來說都是預設使用utf-8來開發framework，所以當你自己再處理的部份導過去時，也理應先處理成utf-8。使用是chardet 偵測網頁編碼除了在每個py檔案加上


# -*- coding: utf-8 -*-

再使用 chardet 先install easy_install chardet 即可


import chardet                                                                  
if sys.getdefaultencoding() != 'utf-8':                                         
        reload(sys)                                                             
        sys.setdefaultencoding('utf-8')

實際上使用他會回傳一個dict


        htmltxt=urllib2.urlopen(url).read()                                     
        chardetdict=chardet.detect(htmltxt)                                     
        if chardetdict.get('encoding')=='Big5':                                 
            htmltxt=htmltxt.decode('big5','ignore').encode('utf-8','ignore')

這樣就可以大致上解決在fetch網頁後編碼變成亂碼的問題了。

peicheng

FLASHC

peicheng 發表在痞客邦留言(0) 人氣(625)

FLASHC

FLASHC It's time to starting forward. Do what you love. Love what you do.

[python]detect webpage encode big5 utf-8

google adsense

靜思語

blogads

近期文章

文章彙整

文章分類

無名正妹時計

氣象星座

links

參觀人氣

個人頭像

楊淑君加油　我們都以妳為榮！

文章搜尋

FLASHC

FLASHC It's time to starting forward. Do what you love. Love what you do. (adsbygoogle = window.adsbygoogle || []).push({});

[python]detect webpage encode big5 utf-8

google adsense

靜思語

blogads

近期文章

文章彙整

文章分類

無名 正妹時計

氣象星座

links

參觀人氣

個人頭像

楊淑君加油 我們都以妳為榮！

文章搜尋

FLASHC It's time to starting forward. Do what you love. Love what you do.

無名正妹時計

楊淑君加油　我們都以妳為榮！