Hello,
i have downloaded tweets for a specific hashtag and wanted to perform some analysis over them. As part of my effort to perform data cleaning on the content, i was trying to remove the byte order mark characters in the tweet content.
An example of a downloaded tweet in csv file is -
b’RT @DMVFollowers: This little girl dressed as her father for Halloween, a @wmata employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via @Double0suge) https://t.co/x5QcDhVMvI’
When i tried to read this csv file in Python3 through pandas library and load it into the dataframe, this tweet comes as an output in below format -
b’RT @DMVFollowers: This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via @Double0suge) https://t.co/x5QcDhVMvI’
I observed that while reading the file, one backslash is appended to the byte order mark characters. I did some research and tried to decode the file using .decode(‘utf-8-sig’) codec, i got error “str don’t have function decode”.
But when i copied the tweet as is from the csv file and used the below code to decode this tweet -
b’RT @DMVFollowers: This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via @Double0suge) https://t.co/x5QcDhVMvI’’.decode(‘utf-8-sig’)
I got below output -
RT @DMVFollowers: This little girl dressed as her father for Halloween, a employee 

(via @Double0suge) [https://t.co/x5QcDhVMvI]
I am not able to figure out how to deal with this content, how to make sure the BOM content is handled properly and hence encoded/decoded in python3.
Please let me know if any questions and i apologize if i have missed on providing any details.
thanks