The internet is chaotic, the structure of websites follow no rules, have no reservations
and will make sure that they are different from that other site that looks the same but it’s made using the latest x javascript technology
because it’s cooler and the frontend developer wanted to learn something new, end of rage .
But even in this chaos, there are some basic principles that most websites follow.
In web scraping, most websites
is your favorite keyword by the way
Diving into specifics now:
Xpath expressions to get title, logo and/or video
-
Title extraction
-
//meta[@property='og:title']/@content
the best if it exists -
//meta[@name='description']/@content
depending desired length this tag also contains a good summary text -
(//*[contains(@*,'content')]//h1 | //*[contains(@*,'content')]//h2)[1]
stay with me! lot of websites use an element namedcontent
to wrap text, it’s first header must be a good title -
//title
game over
-
-
Logo extraction
-
//meta[@property='og:image']//@content
surprise surprise! -
(//*[contains(@*,'content')]//img/@src[contains(.,'jpg')])[1]
the same as above only this time, the first jpg will be returned. Alternatevily usenot(contains(., 'gif'))
to get all non gif images and then decide based on size or other factors
-
-
Video extraction
-
//meta[@property='og:video']//@content
doesn’t exist often -
//iframe[contains(@src,'youtube.com')]/@src
youtube embedded videos -
//iframe[contains(@src,'player.vimeo.com')]/@src
vimeo embedded videos -
/div[@id='main']//embed[@type='application/x-shockwave-flash']/@src
for us 90s boys
-
General tips
- Respect robots.txt and don’t spam people,
- sitemaps are your friends, even if they don’t follow Google’s format, they are still full of internal links,
-
xpath
won’t work on react, angular generally any heavy javascript website, use PhantomJS or Headless Chrome (new kid on the block), Selenium or something equivalent that can first convert javascript to html and then use xpath. - And lastly as this has bitten me a lot in the past, always update your web-scraping machine, along with os security and software updates, ssl certs will be added.Unless you want to swim into ssl errors.