Useful xpaths and tips for web scraping


The internet is chaotic, the structure of websites follow no rules, have no reservations and will make sure that they are different from that other site that looks the same but it’s made using the latest x javascript technology
because it’s cooler and the frontend developer wanted to learn something new, end of rage .

But even in this chaos, there are some basic principles that most websites follow.
In web scraping, most websites is your favorite keyword by the way :smile: Diving into specifics now:

Xpath expressions to get title, logo and/or video

  1. Title extraction
    • //meta[@property='og:title']/@content the best if it exists
    • //meta[@name='description']/@content depending desired length this tag also contains a good summary text
    • (//*[contains(@*,'content')]//h1 | //*[contains(@*,'content')]//h2)[1] stay with me! lot of websites use an element named content to wrap text, it’s first header must be a good title
    • //title game over
  2. Logo extraction
    • //meta[@property='og:image']//@content surprise surprise!
    • (//*[contains(@*,'content')]//img/@src[contains(.,'jpg')])[1] the same as above only this time, the first jpg will be returned. Alternatevily use not(contains(., 'gif')) to get all non gif images and then decide based on size or other factors
  3. Video extraction
    • //meta[@property='og:video']//@content doesn’t exist often
    • //iframe[contains(@src,'youtube.com')]/@src youtube embedded videos
    • //iframe[contains(@src,'player.vimeo.com')]/@src vimeo embedded videos
    • /div[@id='main']//embed[@type='application/x-shockwave-flash']/@src for us 90s boys

General tips

  • Respect robots.txt and don’t spam people,
  • sitemaps are your friends, even if they don’t follow Google’s format, they are still full of internal links,
  • xpath won’t work on react, angular generally any heavy javascript website, use PhantomJS or Headless Chrome (new kid on the block), Selenium or something equivalent that can first convert javascript to html and then use xpath.
  • And lastly as this has bitten me a lot in the past, always update your web-scraping machine, along with os security and software updates, ssl certs will be added.Unless you want to swim into ssl errors.