Hello all ,

I crawled tweets by using Twitter API v2 here is my command line

python search_tweets.py --credential-file twitter_keys1.yaml
–credential-file-key search_tweets_v2
–max-tweets 500000
–results-per-file 10000
–query “(diabetes OR diabetes OR #diabetes OR #diabetes) lang:en”
–expansions author_id,referenced_tweets.id,referenced_tweets.id.author_id,attachments.media_keys,entities.mentions.username,geo.place_id
–tweet-fields id,created_at,author_id,text,context_annotations,public_metrics,lang,conversation_id,geo
–user-fields description,username,verified,public_metrics,location,name,created_at
–place-fields country,full_name,geo,id,name,place_type
–filename-prefix diabetes
–debug

Now I have more than 100 JSON fils to merge, I tried many methods and I failed. I use the JSON merge and most of the data become NAN, then I used flat-table 1.1.1 and it’s also giving me NAN. I looked to the first file and I couldn’t find the parent node, and all the curly brackets are nested and it would be very difficult to do them manually, and I think fils should be merge before any further actions.

References :

attached a sample of one of the JSON files.

Thank you

The editor is actually warning you with a little red cursor at the beginning of the 2nd line and tries to tell you something missing before { which is a ,

[{
	"media_key": "1234567890",
	"type": "photo"
}, {
	"media_key": "0987654321",
	"type": "photo"
}, {
	"media_key": "76545678",
	"type": "photo"
}, {
	"text": "RT @test: test",
	"id": "123"
}]

Don’t forget to add [ at the beginning and ] at to the end of the file.

I highly recommend https://jsonlint.com to validate the output before doing any test with it.

@atakan I don’t know actually how to fix them, do you think I have to do this manually to all 100 files ? what is the best way to do this ? I crawled the data as part such as file1 , file 2 , file3 ? I’m not sure how do fix them all at one time .

I don’t know how to do that in python, but I came upon with that answer on SO after a quick google search.

@atakan Thank you I already have tried this method but it doesn’t work with me, it needs manually add files, in my case more than 100 files should be converted to one file.

In python, you can use glob glob — Unix style pathname pattern expansion — Python 3.7.14 documentation to match multiple files and do something to them (convert from json / stick together into csv with pandas etc)

Thank you @IgorBrigadir you always here for help. Actually, my files are not in one structure they are nested and more complicated than I thought, I tried to use glob as you mention but I got many errors.

Here is a sample of the files that I collected (more than 500 files)

https://drive.google.com/drive/folders/1g8ImmW-AK39-4dPGoV0bZ8dlclYofd11?usp=sharing

I used this code here

import json
from glob import glob

with open(‘json.csv’, ‘w’) as f:
for fname in glob(‘Desktop/json/*.json’): # Reads all json from the current directory
with open(fname) as j:
f.write(str(json.load(j)))
f.write(‘\n’)

Error : JSONDecodeError: Extra data: line 2 column 1 (char 326)

Note: I use these fields, so I’m not sure if there is something wrong with the command-line request, I mean if each result separated the objects in many files, is that mean I can’t put these files together?

python search_tweets.py --credential-file twitter_keys1.yaml
–credential-file-key search_tweets_v2
–max-tweets 500000
–results-per-file 1000
–query “(sad) lang:en”
–expansions author_id,referenced_tweets.id,referenced_tweets.id.author_id,attachments.media_keys,entities.mentions.username,geo.place_id
–tweet-fields id,created_at,author_id,text,context_annotations,public_metrics,lang,conversation_id,geo
–user-fields description,username,verified,public_metrics,location,name,created_at
–place-fields country,full_name,geo,id,name,place_type
–filename-prefix sad1
–debug

Thank you

Your files appear to contain 1 json object per line, so in addition to looping over each file, you’ll have to loop over each line of the file to parse each object - on a quick first look that seems to be all that’s needed. Stack overflow is great for finding snippets of python on how to do this - like python - Loading JSONL file as JSON objects - Stack Overflow

1 Like

Thank you for your comment @IgorBrigadir . I believe there is something wrong with some of the attribute s here https://github.com/search-tweets-python/tree/v2

One key update is handling the changes in how the search endpoint returns its data. The v2 search endpoint returns matching Tweets in a data array, along with an includes array that provides supporting objects that result from specifying expansions. These expanded objects include Users, referenced Tweets, and attached media. In addition to the data and includes arrays, the search endpoint also provides a meta object that provides the max and min Tweet IDs included in the response, along with a next_token if there is another ‘page’ of data to request.

Currently, the v2 client returns the Tweets in the data array as individual (and atomic) JSON Tweet objects. This matches the behavior of the original search client. However, after yielding the individual Tweet objects, the client outputs arrays of User, Tweet, and media objects from the includes array, followed by the meta object.

Finally, the original version of search-tweets-python used a [Tweet Parser](https://twitterdev.github.io/tweet_parser/) to help manage the differences between two different JSON formats ("original" and "Activity Stream"). With v2, there is just one version of Tweet JSON, so this Tweet Parser is not used. In the original code, this Tweet parser was envoked with a tweetify=True directive. With this v2 version, this use of the Tweet Parser is turned off by instead using `tweetify=False

I guess some object causing this problem , I got an error when I used

df_f = pd.read_json(‘Final.json’, orient=‘split’)
result = df_f.head(20)
print(result)

raise ValueError(f"JSON data had unexpected key(s): {bad_keys}")
ValueError: JSON data had unexpected key(s): context_annotations, name, geo, description, withheld, entities, place_type, media_key, referenced_tweets, conversation_id, author_id, type, location, text, public_metrics, id, created_at, attachments, country, username, lang, verified, full_name

raise ValueError(f"JSON data had unexpected key(s): {bad_keys}")
ValueError: JSON data had unexpected key(s): referenced_tweets, context_annotati ons, created_at, country, description, verified, place_type, geo, public_metrics , conversation_id, username, location, withheld, name, full_name, type, id, enti ties, media_key, text, author_id, lang, attachments

Should I discard all the tweets that I collected as some of these attributes has no keys? what do you suggest?

I think it might be faster to download the tweets again, and save the exact returned json values from the API, and then figure out how best to parse them later. I’m not really sure where exactly the errors are.

@IgorBrigadir can I download 300000 tweets in one JSON file or I have to split them, because I think my files it didn’t merge very well and that what cases bad Keyes indexes.

Yes, it’s more effective to store tweets as 1 object per line in a file, and periodically rotate these - like any other logs. You just need to make sure that the format you’re saving is actually json (and not a string representation of a dict for example)

Here’s an example: covid19-twitter-stream-tool/stream.py at master · igorbrigadir/covid19-twitter-stream-tool · GitHub