Web Crawler (2): Web Structure

2020-11-03 本文已影响0人 Yang_silin

Hypertext Makeup Language

HTML is the language to transport information between servers and computers. With CSS and some scripting language built and make up the website. It connects pictures, text, or others through URLs. It looks like a tree and there are many nodes. Every <> represent a node. The node in another node is a child node, so the external node is the father node.

Structure

HTML creates a document by denoting structure semantic to test like TItle, Head, Body
Referer: http://lamyoung.com/img/in-post/201911/2019-11-01-tree.png

image

Header

Tag	meaning
<Header>	information of document
<Title>	Title of document
<base>	URL of default tag
<link>	connection with external resouce
<meta>	original data
<script>	script document
<style>	styling

Body

In the angle brackets, class represents the attribute, and after an equal sign is the value. For example, <div class=“container”> means that there is an attribute class with value container. An element can be located through an attribute pair.

XPATH

There is an absolute and relevant path for describing the location of every tag. With XPath can man easy to locate the element you need.

the absolute path

with the example above, if we want to take the element <div class="row">, we can so express like /body/div/div[@class:“row”]

the relevant path

It's different from the absolute path, the relevant path isn't needed to express the particle path surround the lowest tag rather than the whole path including the topmost tag like <body>
It will express like that: //div[@class:“row”], the most important is it should start with //.

the other expressions

expression	description
.	the current node
..	the father node
//@ attribute	choose the attribute named attribute
*	match any element node
@ *	match any attribute node
//title \| //price	match node "title" and "price"