Twitter not parsing application/xhtml+xml


#1

Twitter appears to not parse X.H.T.M.L. when it is served as application/xhtml+xml.

I have created two examples, using the same mark-up in both but serving one as application/xhtml+xml and the other as text/html.

The Card validator fetches both pages but only finds metatags in the text/html version.

It should be easier to parse the application/xhtml+xml version so I am somewhat perplexed as to why the parser is failing.


#2

I can confirm (over a year after this report) that this is still a problem.


#3

There are no current plans to adapt the cards crawler behaviour. This issue is extremely uncommon.


#4

Please make plans. application/xhtml+xml has been around for at least 15 years and is a very secure way to serve web content because it causes browsers to require well-formed content.

Many if not most XSS attacks rely upon mal-formed html that slips past server-side filters that the browser compensates for. Sending content with this mime type takes away that attack vector.

W3C validates pages sent with this mime type. Twitter really should consider fixing its crawler, because quite simply, it is broken.


#5

Thank you.


#6

I do not mean to be a pest, but all you have to do is use libxml2 to import the DOM. Then you can ask for all meta nodes that are direct children of the first head node. You can also do that with content sent as html, libxml2 can import that too.


#7

I got a workaround. I detect the twitter scraper and send as text/html when I detect the twitter scraper - and it all works.

Would prefer I didn’t have to do that, but it does work.