The internet is chaotic, the structure of websites follow no rules, have no reservations
and will make sure that they are different from that other site that looks the same but it’s made using the latest x javascript technology
because it’s cooler and the frontend developer wanted to learn something new, end of rage .
But even in this chaos, there are some basic principles that most websites follow.
In web scraping, most websites is your favorite keyword by the way
Diving into specifics now:
Xpath expressions to get title, logo and/or video
-
Title extraction
-
//meta[@property='og:title']/@contentthe best if it exists -
//meta[@name='description']/@contentdepending desired length this tag also contains a good summary text -
(//*[contains(@*,'content')]//h1 | //*[contains(@*,'content')]//h2)[1]stay with me! lot of websites use an element namedcontentto wrap text, it’s first header must be a good title -
//titlegame over
-
-
Logo extraction
-
//meta[@property='og:image']//@contentsurprise surprise! -
(//*[contains(@*,'content')]//img/@src[contains(.,'jpg')])[1]the same as above only this time, the first jpg will be returned. Alternatevily usenot(contains(., 'gif'))to get all non gif images and then decide based on size or other factors
-
-
Video extraction
-
//meta[@property='og:video']//@contentdoesn’t exist often -
//iframe[contains(@src,'youtube.com')]/@srcyoutube embedded videos -
//iframe[contains(@src,'player.vimeo.com')]/@srcvimeo embedded videos -
/div[@id='main']//embed[@type='application/x-shockwave-flash']/@srcfor us 90s boys
-
General tips
- Respect robots.txt and don’t spam people,
- sitemaps are your friends, even if they don’t follow Google’s format, they are still full of internal links,
-
xpathwon’t work on react, angular generally any heavy javascript website, use PhantomJS or Headless Chrome (new kid on the block), Selenium or something equivalent that can first convert javascript to html and then use xpath. - And lastly as this has bitten me a lot in the past, always update your web-scraping machine, along with os security and software updates, ssl certs will be added.Unless you want to swim into ssl errors.