Web Crawler (2): Web Structure
Hypertext Makeup Language
HTML is the language to transport information between servers and computers. With CSS and some scripting language built and make up the website. It connects pictures, text, or others through URLs. It looks like a tree and there are many nodes. Every <> represent a node. The node in another node is a child node, so the external node is the father node.
Structure
HTML creates a document by denoting structure semantic to test like TItle, Head, Body
Referer: http://lamyoung.com/img/in-post/201911/2019-11-01-tree.png
Header
Tag | meaning |
---|---|
<Header> | information of document |
<Title> | Title of document |
<base> | URL of default tag |
<link> | connection with external resouce |
<meta> | original data |
<script> | script document |
<style> | styling |
Body
In the angle brackets, class represents the attribute, and after an equal sign is the value. For example, <div class=“container”> means that there is an attribute class with value container. An element can be located through an attribute pair.
XPATH
There is an absolute and relevant path for describing the location of every tag. With XPath can man easy to locate the element you need.
the absolute path
with the example above, if we want to take the element <div class="row">, we can so express like /body/div/div[@class:“row”]
the relevant path
It's different from the absolute path, the relevant path isn't needed to express the particle path surround the lowest tag rather than the whole path including the topmost tag like <body>
It will express like that: //div[@class:“row”], the most important is it should start with //.
the other expressions
expression | description |
---|---|
. | the current node |
.. | the father node |
//@ attribute | choose the attribute named attribute |
* | match any element node |
@ * | match any attribute node |
//title | //price | match node "title" and "price" |