Hi there,

I am harvesting tweets using the following commands in MacOS terminal. The output tweets are mainly non-English. When I open the csv file in R, special characters and non_English characters in the ‘entities.hashtags’ convert to numbers and look like “\u0628\u0644.…”. Although everything looks fine in the tweet ‘text’ (it shows the other language correctly).

Any idea why this might be the case? I would assume that if the tweet ‘text’ looks fine, then ‘entities.hashtags’ should also look fine. Many other variable also look fine. I did upgrade twarc-csv as well.

Here is what I am doing in terminal in MacOS:

pip install --upgrade twarc-csv

twarc2 search --archive --start-time “2018-01-01” --end-time “2018-01-15” --limit 10 “(بلوچستان) -is:retweet” tweets2018_10only.jsonl

twarc2 csv --no-inline-referenced-tweets “tweets2018_10only.jsonl” “tweets2018_10only.csv”

And then I am opening the csv file in R as follows:

data10 ← read.csv(“/Users/name/tweets2018_10only.csv”)

View(data10)

On a side note, I am not sure why there are 99 tweets in the output csv file, I had --limit 10 in my command line.

Thanks for the help in advance!

twarc-csv will output exactly what the API gives you, so if there’s a way to specify in R that it’s UTF8, maybe that will fix it? I don’t know enough about how R loads csv to know if it’s the right thing but try adding fileEncoding="UTF-8", allowEscapes=T when reading the CSV

You get 99 tweets because the default page size is 100 and when you set the limit, it will get 1 page of results minimum. If you really need to, you can also set --max-results 10 to get fewer.

Thanks for the reply, and the clarification about --limit 10. --max-results is very useful. :smiley:

I am a beginner in R as well as in twarc2 so highly appreciate this forum. I am still having problem with the non-English characters in hashtags.

Adding fileEncoding=“UTF-8”, allowEscapes=T while opening the file in R shuffled the column names. However, trying the following seem to have kept the same format of the file that I had without fileEncoding=“UTF-8”.

data10 ← read.csv(“/Users/name/tweets2018_10only.csv”, fileEncoding=“UTF-8”)

I am not sure how to get the correct format of the non-English characters in Hashtags. All other columns (e.g. author.name, author.description, etc) show the non-English characters correctly. Only entities.hashtags seem to be corrupt. Could this be a bug in twarc2?

Thanks for all the help!

R shuffled the column names.

That’s extremely strange - not really sure why that would be. The CSVs produced are technically valid - as in, they pass validators etc. So i’m not sure what could be the problem.

Only entities.hashtags seem to be corrupt.

I think i know why the hashtags may appear “broken” - That would be because hashtags are a json list, - since you can’t expand a list of hashtags into a list of columns in a CSV in a tidy way - it has to be 1 column for the list of hashtags, not a variable number of columns.

To process this, you would have to parse the hashtags column as json objects and it should work - so0mething like json-aaquickstart.knit jsonlite maybe. I think in R this is called “nested dataframes”

The same applies to referenced_tweets - it is also a json list.

Hope that helps!

Thank you, Igor, for the reply.

You are right that the hashtag column is not a normal column, it is a data frame or a list. I have been able to get the hashtags by changing the structure of data frame in R, and by splitting the text in Stata. The end result is the same (I get multiple columns with each element a hashtag or blank that I can use for analysis).

The same happens in the case of non-English hashtags, however, the word just has numbers – that is, it is not readable.

I will look into parsing columns as json objects, however, given my attempts in R and Stata, it just seems like that it is about the non-English characters and not about the data structure. I could be wrong though.

In the meanwhile, is it possible to get the tweets that have hashtags (and ignore those that do not have any hashtag)? Could something like this work?

twarc2 search --archive --start-time “2018-01-01” --end-time “2018-01-15” --limit 10 “(بلوچستان) (-is:retweet) (-has:hashtags)” tweets2018_10only.jsonl

I am sorry for a very naive addition of (-has:hashtags) in the command line. I am not able to find the parameters that I can have in my twarc search query. I certainly saw how to select and reject certain words and how to make the search logic but could not find the description if I can get tweets based on having hashtags.

Thanks for all the help!

Yeah has:hashtags is correct Search Tweets - How to build a query | Docs | Twitter Developer Platform (this is the full list of valid query operators) but negating it with a - like -has:hashtags will give you all tweets that do not have any hashtags.

This one should work (you can leave out the () around the has and is)

twarc2 search --archive --start-time "2018-01-01" --end-time "2018-01-15" --limit 10 "(بلوچستان) -is:retweet has:hashtags" tweets2018_10only.jsonl

This gives you (بلوچستان) tweets that are not retweets, and have at least 1 hashtag.

Thank you so much! This is super helpful.

I will try to parse the hashtags column as json objects before I convert my json file to csv – I think that’s what you meant – I hope this can help show the non-English characters in the hashtags correctly. I will update here if I have any success in this regard. :slightly_smiling_face:

Once again, thank you!

1 Like