Hi,
I have collected a huge set of tweets with the Academic Research access, but some of the Tweets contain weird and random characters. Before collecting, I specified lang:en (only English tweets).
Example:
🇺🇦, you’, I’m not, portrayed as a “hero,�

Tweets were collected with a Python script to a CSV(UTF-8) file on Windows

Is there a simple solution for this?

Thanks in advance

BR

How exactly did you collect the tweets? Using what code / tool? And what was the command you used? Sometimes, windows will write things out in UTF-16 not UTF-8, so it may be a matter of re-opening it in the right encoding.

Hi, I followed this tutorial to collect my tweets https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a. In the tutorial, when creating a csv-file, he specifies UTF-8 encoding, could it be something else than encoding? I think these random symbols are emojis. Since when I open my csv file in jupyter notebook dashboard, emojis are shown in the tweet text. But when I open from Windows, emojis are not present but these random symbols are

Oh! Ok, if they work in jupyter, then the files are saved correctly i would assume. In windows, if you open in Notepad sometimes it picks the wrong encoding - it’s worth trying to open it in something like vscode instead.

As for the overall task, i highly recommend using twarc instead of coding your own implementation for retrieving tweets and converting to CSV (you will likely have to re-invent and implement all the things twarc already solves)

1 Like