mysql 中的 window function

2019-06-05 本文已影响2人鲸鱼酱375

英文代码以及讲解来自datacamp
窗口的概念非常重要，它可以理解为记录集合，窗口函数也就是在满足某种条件的记录集合上执行的特殊函数。对于每条记录都要在此窗口内执行函数，有的函数随着记录不同，窗口大小都是固定的，这种属于静态窗口；有的函数则相反，不同的记录对应着不同的窗口，这种动态变化的窗口叫滑动窗口。

1.window function中的over()

1.1 over用法

The OVER() clause allows you to pass an aggregate function down a data set, similar to subqueries in SELECT. The OVER() clause offers significant benefits over subqueries in select -- namely, your queries will run faster, and the OVER() clause has a wide range of additional functions and clauses you can include with it that we will cover later on in this chapter.

函数名（[expr]） over子句

其中，over是关键字，用来指定函数执行的窗口范围，如果后面括号中什么都不写，则意味着窗口包含满足where条件的所有行，窗口函数基于所有行进行计算；如果不为空，则支持以下四种语法来设置窗口：

a.window_name：给窗口指定一个别名，如果SQL中涉及的窗口较多，采用别名可以看起来更清晰易读
b.partition子句：窗口按照那些字段进行分组，窗口函数在不同的分组上分别执行。
c.order by子句：按照哪些字段进行排序，窗口函数将按照排序后的记录顺序进行编号。可以和partition子句配合使用，也可以单独使用。
d.frame子句：frame是当前分区的一个子集，子句用来定义子集的规则，通常用来作为滑动窗口使用。

1.2 frame滑动窗口

对于滑动窗口的范围指定，有两种方式，基于行和基于范围，具体区别如下

1.2.1 基于行

通常使用BETWEEN frame_start AND frame_end语法来表示行范围，frame_start和frame_end可以支持如下关键字，来确定不同的动态行记录：

CURRENT ROW 边界是当前行，一般和其他范围关键字一起使用
UNBOUNDED PRECEDING 边界是分区中的第一行
UNBOUNDED FOLLOWING 边界是分区中的最后一行
expr PRECEDING 边界是当前行减去expr的值
expr FOLLOWING 边界是当前行加上expr的值

1.2.2 基于范围

和基于行类似，但有些范围不是直接可以用行数来表示的，比如希望窗口范围是一周前的订单开始，截止到当前行，则无法使用rows来直接表示，此时就可以使用范围来表示窗口：INTERVAL 7 DAY PRECEDING。Linux中常见的最近1分钟、5分钟负载是一个典型的应用场景。

函数包括：CUME_DIST()
DENSE_RANK()
LAG()
LEAD()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()

SELECT 
    # Select the id, country name, season, home, and away goals
    m.id, 
    c.name AS country, 
    m.season,
    m.home_goal,
    m.away_goal,
    # Use a window to include the aggregate average in each row
    avg(m.home_goal +m.away_goal ) over() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;

2. window function 中的 rank用法；row_number()用法

2.1 rank

Window functions allow you to create a RANK of information according to any variable you want to use to sort your data. When setting this up, you will need to specify what column/calculation you want to use to calculate your rank. This is done by including an ORDER BY clause inside the OVER() clause.

SELECT 
    id,
    RANK() OVER(ORDER BY home_goal) AS rank
FROM match;

EG
In this exercise, you will create a data set of ranked matches according to which leagues, on average, score the most goals in a match.

SELECT 
    #Select the league name and average goals scored
    l.name AS league,
    AVG(m.home_goal + m.away_goal) AS avg_goals,
    # Rank each league according to the average goals
    RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal)) AS league_rank
FROM league AS l
LEFT JOIN match AS m 
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
# Order the query by the rank you created
ORDER BY league_rank;

来自datacamp

EG：
In the last exercise, the rank generated in your query was organized from smallest to largest. By adding DESC to your window function, you can create a rank sorted from largest to smallest.

SELECT 
    # Select the league name and average goals scored
    l.name AS league,
    avg(m.home_goal + m.away_goal) AS avg_goals,
    # Rank leagues in descending order by average goals
    rank ()over(order by avg(m.home_goal + m.away_goal) desc) AS league_rank
FROM league AS l
LEFT JOIN match AS m 
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
# Order the query by the rank you created
order by league_rank;

2.2 row_number()

eg: 取出没门课程的第一名

CREATE TABLE window_test
  (id int, 
  name text, 
  subject text, 
  score numeric
  );
  
INSERT INTO window_test VALUES (1,'小黄','数学',99.5), (2,'小黄','语文',89.5),(3,'小黄','英语',79.5), (4,'小黄','物理',99.5), (5,'小黄','化学',98.5), (6,'小红','数学',89.5), (7,'小红','语文',99.5), (8,'小红','英语',79.5), (9,'小红','物理',89.5), (10,'小红','化学',69.5),(11,'小绿','数学',89.5), (12,'小绿','语文',91.5), (13,'小绿','英语',92.5),(14,'小绿','物理',93.5), (15,'小绿','化学',94.5);

>select * from window_test;

正常解法：

select b.* from
 (select subject,max(score) as score from window_test group by subject) a     
 join window_test  b on  a.score = b.score and a.subject = b.subject;

用row_number

select id,name,subject,score from  (select row_number() over (partition by subject order by score desc) as rn,
id,name,subject,score from window_test )t where rn=1;

ROW_NUMBER()：顺序排序——1、2、3
RANK()：并列排序，跳过重复序号——1、1、3
DENSE_RANK()：并列排序，不跳过重复序号——1、1、2

3.window function中的 over 与partition by用法

partition by： calculate separate values for different categories
calculate different calculations in the same column

AVG(home_goal) OVER (PARTITION BY season)

3.1 partition by 一列

datacamp练习：
In this exercise, you will be creating a data set of games played by Legia Warszawa (Warsaw League), the top ranked team in Poland, and comparing their individual game performance to the overall average for that season.

Where do you see the more outliers? Are they Legia Warszawa's home or away games?

SELECT
    date,
    season,
    home_goal,
    away_goal,
    CASE WHEN hometeam_id = 8673 THEN 'home' 
         ELSE 'away' END AS warsaw_location,
    #Calculate the average goals scored partitioned by season
    avg(home_goal) over(PARTITION BY season) AS season_homeavg,
    avg(away_goal) over(PARTITION BY season) AS season_awayavg
FROM match
# Filter the data set for Legia Warszawa matches only
WHERE 
    hometeam_id = 8673 
    OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;

3.2 partition by 多列

The PARTITION BY clause can be used to break out window averages by multiple data points (columns). You can even calculate the information you want to use to partition your data! For example, you can calculate average goals scored by season and by country, or by the calendar year (taken from the date column).

In this exercise, you will calculate the average number home and away goals scored Legia Warszawa, and their opponents, partitioned by the month in each season.

SELECT 
    date,
    season,
    home_goal,
    away_goal,
    CASE WHEN hometeam_id = 8673 THEN 'home' 
         ELSE 'away' END AS warsaw_location,
    #Calculate average goals partitioned by season and month
    avg(home_goal) over(partition by season, 
            EXTRACT(month FROM date)) AS season_mo_home,
    avg(away_goal) over(partition by season, 
            EXTRACT(month FROM date)) AS season_mo_away
FROM match
WHERE 
    hometeam_id = 8673 
    OR awayteam_id = 8673
ORDER BY (home_goal + away_goal) DESC;

4.sliding windows

perform calculations relative to the current row
can be used to calculate running totals, sums, averages
can be partition by one or more columns

ROW BETWEEN <start>  AND <finish>

Sliding windows allow you to create running calculations between any two points in a window using functions such as PRECEDING, FOLLOWING, and CURRENT ROW. You can calculate running counts, sums, averages, and other aggregate functions between any two points you specify in the data set.

rows BETWEEN 1 PRECEDING AND 1 FOLLOWING 窗口范围是当前行、前一行、后一行一共三行记录。
rows UNBOUNDED FOLLOWING 窗口范围是当前行到分区中的最后一行。
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 窗口范围是当前分区中所有行，等同于不写。

In this exercise, you will expand on the examples discussed in the video, calculating the running total of goals scored by the FC Utrecht when they were the home team during the 2011/2012 season. Do they score more goals at the end of the season as the home or away team?

SELECT 
    date,
    home_goal,
    away_goal,
    # Create a running total and running average of home goals
    SUM(home_goal) over(ORDER BY date 
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
    avg(home_goal) over(ORDER BY date 
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE 
    hometeam_id = 9908 
    AND season = '2011/2012';

In this exercise, you will slightly modify the query from the previous exercise by sorting the data set in reverse order and calculating a backward running total from the CURRENT ROW to the end of the data set (earliest record).

SELECT 
    -- Select the date, home goal, and away goals
    DATE,
    home_goal,
    away_goal,
    #Create a running total and running average of home goals
    SUM(home_goal) OVER(ORDER BY date DESC
         ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
    avg(home_goal) over(ORDER BY date DESC
         ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE 
    awayteam_id = 9908 
    AND season = '2011/2012';

5. 与CASE，CTE合用

how badly did Manchester United lose in each match?

In order to determine this, let's add a window function to the main query that ranks matches by the absolute value of the difference between home_goal and away_goal. This allows us to directly compare the difference in scores without having to consider whether Manchester United played as the home or away team!

The equation is complete for you -- all you need to do is properly complete the window function!

WITH HOME AS (
  SELECT m.id, t.team_long_name,
      CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
           WHEN m.home_goal < m.away_goal THEN 'MU Loss' 
           ELSE 'Tie' END AS outcome
  FROM match AS m
  LEFT JOIN team AS t ON m.hometeam_id = t.team_api_id),
# Set up the away team CTE
AWAY AS (
  SELECT m.id, t.team_long_name,
      CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
           WHEN m.home_goal < m.away_goal THEN 'MU Win' 
           ELSE 'Tie' END AS outcome
  FROM match AS m
  LEFT JOIN team AS t ON m.awayteam_id = t.team_api_id)
# Select columns and and rank the matches by date
SELECT DISTINCT
    m.date,
    home.team_long_name AS home_team,
    away.team_long_name AS away_team,
    m.home_goal, m.away_goal,
    rank() over(order by ABS(home_goal - away_goal) desc) as match_rank
# Join the CTEs onto the match table
FROM match AS m
left JOIN home ON m.id = home.id
left JOIN away ON m.id = away.id
WHERE m.season = '2014/2015'
      AND ((home.team_long_name = 'Manchester United' AND home.outcome = 'MU Loss')
      OR (away.team_long_name = 'Manchester United' AND away.outcome = 'MU Loss'));

6.应用场景

6.1 希望查询每个用户订单金额最高的前三个订单

image.png

上面红色粗体显示了三个函数的区别，row_number()在amount都是800的两条记录上随机排序，但序号按照1、2递增，后面amount为600的的序号继续递增为3，中间不会产生序号间隙；rank()/dense_rank()则把amount为800的两条记录序号都设置为1，但后续amount为600的需要则分别设置为3（rank）和2（dense_rank）。即rank（）会产生序号相同的记录，同时可能产生序号间隙；而dense_rank（）也会产生序号相同的记录，但不会产生序号间隙。

一般排序用rank, 但是数据里面有重复的值，最好用dense_rank???

6.2 查询上一个订单距离当前订单的时间间隔。

image.png

内层SQL先通过lag函数得到上一次订单的日期，外层SQL再将本次订单和上次订单日期做差得到时间间隔diff。

6.3 查询截止到当前订单，按照日期排序第一个订单和最后一个订单的订单金额。

image.png

结果和预期一致，比如order_id为4的记录，first_amount和last_amount分别记录了用户‘001’截止到时间2018-01-03 00:00:00为止，第一条订单金额100和最后一条订单金额800，注意这里是按时间排序的最早订单和最晚订单，并不是最小金额和最大金额订单。

6.4 每个用户按照订单id，截止到当前的累计订单金额/平均订单金额/最大订单金额/最小订单金额/订单数是多少？

image.png

reference:
https://dbaplus.cn/news-11-2258-1.html

https://yq.aliyun.com/articles/593591

mysql 中的 window function

1.window function中的over()

1.1 over用法

1.2 frame滑动窗口

1.2.1 基于行

1.2.2 基于范围

2. window function 中的 rank用法；row_number()用法

2.1 rank

2.2 row_number()

3.window function中的 over 与partition by用法

3.1 partition by 一列

3.2 partition by 多列

4.sliding windows

5. 与CASE，CTE合用

6.应用场景

6.1 希望查询每个用户订单金额最高的前三个订单

6.2 查询上一个订单距离当前订单的时间间隔。

6.3 查询截止到当前订单，按照日期排序第一个订单和最后一个订单的订单金额。

6.4 每个用户按照订单id，截止到当前的累计订单金额/平均订单金额/最大订单金额/最小订单金额/订单数是多少？

猜你喜欢

热点阅读