I had a use case where I was following tweets from specific twitter account and mentions of tweets from these twitter account. I am interested in extracting valid urls (not twitter.com/status/…, but real article urls ). I thoroughly investigated API documentation, and now I am confused about one aspect. There is [‘entities’][‘urls’], when truncated: True, then [‘extended_entities’][‘urls’] and both of these fields in [‘extended_tweet’]. Now, as per my understanding when tweet is longer in character limit, then extended_entities was introduced to put whatever fields in entities in there as normal entities field url can contain the twitter url (most of the times) and that being reflected in [‘entities’][‘urls’]. I don’t need that twitter url but an article url which might have been part of the tweet text.
I have case where actual url is present in entities.urls within extended_tweet with truncated:true but urls fields in extended_entities is completely empty. Can someone give me a heads up how to design my algorithm to extract real articles urls based on presence of certain fields.
For example here is the sample of a tweet.
`{
"filter_level": "low",
"coordinates": null,
"favorite_count": 0,
"contributors": null,
"in_reply_to_user_id_str": "1917731",
"favorited": false,
"possibly_sensitive": false,
"is_quote_status": false,
"display_text_range": [9, 140],
"text": "@thehill You can't have it both ways #MSM! It was portrayed as a snub and disrespectful in 2010. And now, it's what… https://t.co/kfjTYFXvBV",
"created_at": "Wed Nov 29 00:03:29 +0000 2017",
"user": {
"contributors_enabled": false,
"profile_use_background_image": true,
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
"default_profile_image": false,
"id_str": "266851619",
"lang": "en",
"profile_background_tile": false,
"friends_count": 245,
"follow_request_sent": null,
"following": null,
"protected": false,
"profile_sidebar_fill_color": "DDEEF6",
"name": "Gary B",
"default_profile": true,
"followers_count": 20,
"profile_text_color": "333333",
"profile_sidebar_border_color": "C0DEED",
"profile_image_url": "http://pbs.twimg.com/profile_images/2202107814/DefofJ3_normal.jpg",
"profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
"favourites_count": 96,
"translator_type": "none",
"url": null,
"notifications": null,
"time_zone": "Pacific Time (US & Canada)",
"screen_name": "gwilliamb",
"profile_link_color": "1DA1F2",
"id": 266851619,
"created_at": "Tue Mar 15 22:58:05 +0000 2011",
"profile_background_color": "C0DEED",
"geo_enabled": false,
"is_translator": false,
"verified": false,
"utc_offset": -28800,
"profile_image_url_https": "https://pbs.twimg.com/profile_images/2202107814/DefofJ3_normal.jpg",
"description": null,
"location": null,
"listed_count": 2,
"statuses_count": 1228
},
"in_reply_to_status_id": 935655312807419904,
"in_reply_to_screen_name": "thehill",
"id": 935660449659367425,
"source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"in_reply_to_user_id": 1917731,
"lang": "en",
"place": null,
"retweeted": false,
"entities": {
"hashtags": [{
"text": "MSM",
"indices": [37, 41]
}],
"user_mentions": [{
"id": 1917731,
"indices": [0, 8],
"name": "The Hill",
"screen_name": "thehill",
"id_str": "1917731"
}],
"symbols": [],
"urls": [{
"expanded_url": "https://twitter.com/i/web/status/935660449659367425",
"indices": [117, 140],
"url": "https://t.co/kfjTYFXvBV",
"display_url": "twitter.com/i/web/status/9…"
}]
},
"id_str": "935660449659367425",
"truncated": true,
"quote_count": 0,
"extended_tweet": {
"entities": {
"hashtags": [{
"text": "MSM",
"indices": [37, 41]
}],
"media": [{
"id": 935660272261283840,
"expanded_url": "https://twitter.com/gwilliamb/status/935660449659367425/photo/1",
"sizes": {
"thumb": {
"w": 150,
"resize": "crop",
"h": 150
},
"medium": {
"h": 789,
"resize": "fit",
"w": 996
},
"large": {
"resize": "fit",
"w": 996,
"h": 789
},
"small": {
"resize": "fit",
"w": 680,
"h": 539
}
},
"media_url_https": "https://pbs.twimg.com/media/DPwiD26UQAAmLKt.jpg",
"url": "https://t.co/epCeGZPlb0",
"id_str": "935660272261283840",
"display_url": "pic.twitter.com/epCeGZPlb0",
"media_url": "http://pbs.twimg.com/media/DPwiD26UQAAmLKt.jpg",
"indices": [238, 261],
"type": "photo"
}],
"symbols": [],
"user_mentions": [{
"screen_name": "thehill",
"id": 1917731,
"indices": [0, 8],
"name": "The Hill",
"id_str": "1917731"
}],
"urls": [{
"expanded_url": "http://bit.ly/2hXMWuz",
"indices": [214, 237],
"display_url": "bit.ly/2hXMWuz",
"url": "https://t.co/UMwpCtNgXI"
}]
},
"display_text_range": [9, 237],
"extended_entities": {
"media": [{
"expanded_url": "https://twitter.com/gwilliamb/status/935660449659367425/photo/1",
"sizes": {
"small": {
"h": 539,
"resize": "fit",
"w": 680
},
"large": {
"resize": "fit",
"w": 996,
"h": 789
},
"medium": {
"h": 789,
"resize": "fit",
"w": 996
},
"thumb": {
"resize": "crop",
"w": 150,
"h": 150
}
},
"id": 935660272261283840,
"type": "photo",
"url": "https://t.co/epCeGZPlb0",
"media_url_https": "https://pbs.twimg.com/media/DPwiD26UQAAmLKt.jpg",
"id_str": "935660272261283840",
"display_url": "pic.twitter.com/epCeGZPlb0",
"indices": [238, 261],
"media_url": "http://pbs.twimg.com/media/DPwiD26UQAAmLKt.jpg"
}]
},
"full_text": "@thehill You can't have it both ways #MSM! It was portrayed as a snub and disrespectful in 2010. And now, it's what? Hypocrisy runs deep and rampant when journalist don't do their jobs. Provide historical context. https://t.co/UMwpCtNgXI https://t.co/epCeGZPlb0"
},
"timestamp_ms": "1511913809586",
"in_reply_to_status_id_str": "935655312807419904",
"reply_count": 0,
"geo": null,
"retweet_count": 0
}`