Useful xpaths and tips for web scraping

The internet is chaotic, the structure of websites follow no rules, have no reservations and will make sure that they are different from that other site that looks the same but it’s made using the latest x javascript technology
because it’s cooler and the frontend developer wanted to learn something new, end of rage .

But even in this chaos, there are some basic principles that most websites follow.
In web scraping, most websites is your favorite keyword by the way Diving into specifics now:

Xpath expressions to get title, logo and/or video

Title extraction
- //meta[@property='og:title']/@content the best if it exists
- //meta[@name='description']/@content depending desired length this tag also contains a good summary text
- (//*[contains(@*,'content')]//h1 | //*[contains(@*,'content')]//h2)[1] stay with me! lot of websites use an element named content to wrap text, it’s first header must be a good title
- //title game over
Logo extraction
- //meta[@property='og:image']//@content surprise surprise!
- (//*[contains(@*,'content')]//img/@src[contains(.,'jpg')])[1] the same as above only this time, the first jpg will be returned. Alternatevily use not(contains(., 'gif')) to get all non gif images and then decide based on size or other factors
Video extraction
- //meta[@property='og:video']//@content doesn’t exist often
- //iframe[contains(@src,'youtube.com')]/@src youtube embedded videos
- //iframe[contains(@src,'player.vimeo.com')]/@src vimeo embedded videos
- /div[@id='main']//embed[@type='application/x-shockwave-flash']/@src for us 90s boys

General tips

Respect robots.txt and don’t spam people,
sitemaps are your friends, even if they don’t follow Google’s format, they are still full of internal links,
xpath won’t work on react, angular generally any heavy javascript website, use PhantomJS or Headless Chrome (new kid on the block), Selenium or something equivalent that can first convert javascript to html and then use xpath.
And lastly as this has bitten me a lot in the past, always update your web-scraping machine, along with os security and software updates, ssl certs will be added.Unless you want to swim into ssl errors.

PREVIOUSChange EBS volume size without downtime

NEXT2 reasons to use ElasticSearch aliases

Xpath expressions to get title, logo and/or video

Title extraction

Logo extraction

Video extraction

General tips