Handling Byte Order Mark characters in tweet content through Python3

python

#1

Hello,

i have downloaded tweets for a specific hashtag and wanted to perform some analysis over them. As part of my effort to perform data cleaning on the content, i was trying to remove the byte order mark characters in the tweet content.

An example of a downloaded tweet in csv file is -
b’RT @DMVFollowers: This little girl dressed as her father for Halloween, a @wmata employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via @Double0suge) https://t.co/x5QcDhVMvI’

When i tried to read this csv file in Python3 through pandas library and load it into the dataframe, this tweet comes as an output in below format -
b’RT @DMVFollowers: This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via @Double0suge) https://t.co/x5QcDhVMvI’

I observed that while reading the file, one backslash is appended to the byte order mark characters. I did some research and tried to decode the file using .decode(‘utf-8-sig’) codec, i got error “str don’t have function decode”.

But when i copied the tweet as is from the csv file and used the below code to decode this tweet -

b’RT @DMVFollowers: This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via @Double0suge) https://t.co/x5QcDhVMvI’’.decode(‘utf-8-sig’)

I got below output -

RT @DMVFollowers: This little girl dressed as her father for Halloween, a employee :joy::joy::ok_hand: (via @Double0suge) [https://t.co/x5QcDhVMvI]

I am not able to figure out how to deal with this content, how to make sure the BOM content is handled properly and hence encoded/decoded in python3.

Please let me know if any questions and i apologize if i have missed on providing any details.

thanks


#2

I also would like to mention here that “@wmata” content that was there in first tweet was also there in all the next outputs. i mistakenly removed it in the above post while copying from the console.

I apologize for any in-convenience.

thanks


#3

Hi @nakul_31 - this looks like it’s more of a Python 3 issue. Here’s a StackOverflow post that will probably help you with this issue. Let us know if there’s anything else we can help with.