# 1 亿条淘宝用户行为数据分析

## 1. 部署环境

环境采用本地Winodws+VMware虚拟化部署基于Ambari的Hadoop集群。

集群配置如下

|           | namenode \* 1      | datanode \* 3      |
| --------- | ------------------ | ------------------ |
| i5-12450H | 2 core \* 1 thread | 2 core \* 1 thread |
| Memory    | 12GB               | 12GB               |
| Disk      | 150GB              | 100GB              |
| 额外软件包     | openJDK8、Mysql5.7  | openJDK8           |

集群安装的hadoop生态组件如下

|                |           |
| -------------- | --------- |
| HDFS           | 3.1.1.3.1 |
| YARN           | 3.1.1     |
| MapReduce2     | 3.1.1     |
| Tez            | 0.9.1     |
| Hive           | 3.1.0     |
| HBase          | 2.0.2     |
| ZooKeeper      | 3.4.6     |
| Ambari Metrics | 0.1.0     |

{% hint style="info" %}
**服务器调优**

关于YARN和总内存、总vcore和单个容器的内存、vocre分配需要自行选择合适的参数，尽量不浪费资源。
{% endhint %}

## 2. 数据集下载

{% embed url="<https://pan.baidu.com/s/1CPD5jpmvOUvg1LETAVETGw>" %}
提取码：m4mc
{% endembed %}

这是一份来自淘宝的用户行为数据，时间区间为 2017-11-25 到 2017-12-03，总计 100,150,807 条记录，大小为 3.5 G，包含 5 个字段。

<table><thead><tr><th width="309"></th><th></th></tr></thead><tbody><tr><td>用户ID（User ID）</td><td>一个整数</td></tr><tr><td>商品ID（Item ID）</td><td>一个整数</td></tr><tr><td>类别ID（Category ID）</td><td>一个整数</td></tr><tr><td>行为类型（Behavior Type）</td><td>“pv”（浏览）、“buy”（购买）、“cart”（加入购物车）、“fav”（收藏）</td></tr><tr><td>时间戳（Timestamp）</td><td>一个整数，通常以秒为单位。</td></tr></tbody></table>

## 3. 数据处理和表优化

### 2.1 数据导入

`beeline -n hive -p`进入hql命令行

创建一个临时表，并加载csv数据文件加载到其中。

```sql
CREATE TEMPORARY TABLE temp_user_behavior (
`user_id` string comment 'user ID',
`item_id` string comment 'item ID',
`category_id` string comment 'category ID',
`behavior_type` string  comment 'behavior type among pv, buy, cart, fav',
`timestamp` int comment 'timestamp')
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/root/Hive/UserBehavior.csv' OVERWRITE INTO TABLE temp_user_behavior ;
```

创建用户行为表1。

```sql
drop table if exists user_behavior1;
create table user_behavior1 (
`user_id` string comment 'user ID',
`item_id` string comment 'item ID',
`category_id` string comment 'category ID',
`timestamp` timestamp comment 'timestamp'
)
PARTITIONED BY (`date` date, `behavior_type` string comment 'behavior type among pv, buy, cart, fav')
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
```

{% hint style="info" %}
这里使用以列优先的存储格式，定义压缩算法为snappy，对于像电商分析这样主要查询列的项目，会提高很多效率。同时对日期date进行分区，以及用户行为behavior\_type进行分区是一种合理的分区方法，在后续分析过程中将大大提高查询速度。
{% endhint %}

将数据导入到ORC表中，hive会自动执行 行列转化

```sql
INSERT OVERWRITE TABLE user_behavior1 PARTITION (`date`, behavior_type)
SELECT
  user_id,
  item_id,
  category_id,
  from_unixtime(`timestamp`) AS `timestamp`,
  date(from_unixtime(`timestamp`)) AS `date`,
  behavior_type
FROM
  temp_user_behavior;
```

查看一共多少条数据。

```sql
select count(*) from user_behavior1;
+------------+
|    _c0     |
+------------+
| 100150807  |
+------------+
```

### 2.2 数据清洗

<pre class="language-sql"><code class="lang-sql"><strong>-- 查看时间-数据分布情况，是否有异常值
</strong>select `date`, COUNT(*) from user_behavior1 group by `date` order by `date`;

<strong>-- 删除不在2017-11-25 到 2017-12-03日期的数据
</strong>alter table user_behavior1 
drop IF EXISTS partition (`date`&#x3C;'2017-11-25'), partition (`date`>'2017-12-03');

-- 再次查看时间是否有异常值
select `date`, COUNT(*) from user_behavior1 group by `date` order by `date`;

+-------------+-----------+
|    date     |    _c1    |
+-------------+-----------+
| 2017-11-25  | 10511605  |
| 2017-11-26  | 10571046  |
| 2017-11-27  | 10013457  |
| 2017-11-28  | 9884189   |
| 2017-11-29  | 10319066  |
| 2017-11-30  | 10541698  |
| 2017-12-01  | 11171515  |
| 2017-12-02  | 13940949  |
| 2017-12-03  | 11961008  |
+-------------+-----------+
</code></pre>

```sql
--查看 behavior_type 是否有异常值
select behavior_type, COUNT(*) from user_behavior1 group by behavior_type;

+----------------+-----------+
| behavior_type  |    _c1    |
+----------------+-----------+
| cart           | 5466118   |
| pv             | 88596903  |
| buy            | 1998976   |
| fav            | 2852536   |
+----------------+-----------+
```

```sql
-- 去掉完全重复的数据
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE user_behavior1
PARTITION (`date`, `behavior_type`)
SELECT DISTINCT `user_id`, `item_id`, `category_id`, `timestamp`, `date`, `behavior_type`
FROM user_behavior1
DISTRIBUTE BY `date`, `behavior_type`;

-- 查看目前多少条
select count(*) from user_behavior1;
+-----------+
|    _c0    |
+-----------+
| 98914484  |
+-----------+
```

{% hint style="info" %}
DISTRIBUTE BY `date`, `behavior_type`这个是用来指定数据分发的策略，它会根据分区键的值将数据分发到不同的reduce任务中，每个reduce任务只处理一个分区的数据。这样就可以在每个分区内部去重，而不需要到全局数据去比较，所以效率高很多。在hdfs上，表按照 `date`, `behavior_type`分区后，分区的文件夹数量= `date`分区数\* `behavior_type`分区数。当DISTRIBUTE BY `date`, `behavior_type`;时，可以理解为是在`date`分区数\* `behavior_type`分区数 这么多个局部中比较去重。

同时如果DISTRIBUTE BY `date`, `behavior_type`粒度划分的太细，导致启动的容器太多，计算时间占比较低，可以选择只DISTRIBUTE BY一个。
{% endhint %}

## 3.数据分析可视化

### 3.1 基于时间的用户行为分析

#### 3.1.1 总访问量PV，总用户量UV

<pre class="language-sql"><code class="lang-sql">--总访问量PV，总用户量UV
create table res_pv_uv
comment "page views and unique visitor"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select pv, uv
from (
    select count(*) as pv from user_behavior1 where behavior_type='pv'
) t1
join(
<strong>    select count(distinct user_id) as uv from user_behavior1
</strong><strong>) t2
</strong><strong>on 1=1;
</strong>
select * from res_pv_uv;
+-----------+---------+
|    pv     |   uv    |
+-----------+---------+
| 88596886  | 987984  |
+-----------+---------+
</code></pre>

{% hint style="info" %}
思考以下select语句的优劣和正确性，和上面的语句谁能更好的发挥分区表的优势

{% code lineNumbers="true" %}

```sql
select sum(pv) as pv, count(distinct uv) as uv from (
    select count(*) as pv, distinct user_id as uv from user_behavior1 where behavior_type='pv'
    union all
    select 0 as pv, distinct user_id as uv from user_behavior1 where behavior_type='buy'
    union all
    select 0 as pv, distinct user_id as uv from user_behavior1 where behavior_type='cart'
    union all
    select 0 as pv, distinct user_id as uv from user_behavior1 where behavior_type='fav'
) t;

```

{% endcode %}

{% code lineNumbers="true" %}

```sql
select sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,
       count(distinct user_id) as uv
from user_behavior1;
```

{% endcode %}
{% endhint %}

#### 3.1.2 日均访问量，日均用户量

<pre class="language-sql"><code class="lang-sql">--日均访问量，日均用户量
<strong>create table res_pv_uv_per_day
</strong>comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select t1.`date` as `date`, pv, uv
from(
    select `date`, count(*) as pv from user_behavior1 group by `date` order by `date`
) t1
join(
    select `date`, count(distinct user_id) as uv from user_behavior1 group by `date` order by `date`
) t2
on t1.`date`=t2.`date`
order by `date`;

+--------------------------+------------------------+------------------------+
| res_pv_uv_per_day1.date  | res_pv_uv_per_day1.pv  | res_pv_uv_per_day1.uv  |
+--------------------------+------------------------+------------------------+
| 2017-11-25               | 10511597               | 705571                 |
| 2017-11-26               | 10571039               | 713522                 |
| 2017-11-27               | 10013455               | 709207                 |
| 2017-11-28               | 9884185                | 708339                 |
| 2017-11-29               | 10319060               | 719356                 |
| 2017-11-30               | 10541695               | 730809                 |
| 2017-12-01               | 11171505               | 753166                 |
| 2017-12-02               | 13940942               | 941709                 |
| 2017-12-03               | 11961006               | 917531                 |
+--------------------------+------------------------+------------------------+
</code></pre>

#### 3.1.3 一天的活跃时段分布

```sql
-- 一天的活跃时段分布
create table res_behavior_among_day
comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select t.hour as hour,
       collect_list(map(t.behavior_type, t.count)) as tc
from(
SELECT hour(`timestamp`) AS hour, 
       behavior_type, 
       COUNT(*) AS count
FROM user_behavior1
GROUP BY behavior_type, hour(`timestamp`)
) t
group by hour
order by hour;
 
 +---------+----------------------------------------------------+
| hour    |                        tc                          |
+---------+----------------------------------------------------+
| 0       | [{"buy":64916},{"cart":192036},{"fav":103721},{"pv":3042342}] |
| 1       | [{"buy":96134},{"cart":229890},{"fav":127976},{"pv":3728498}] |
| 2       | [{"buy":127932},{"cart":266963},{"fav":147752},{"pv":4334810}] |
| 3       | [{"buy":122046},{"cart":260831},{"fav":145412},{"pv":4213518}] |
| 4       | [{"buy":118591},{"cart":255811},{"fav":140862},{"pv":4255794}] |
| 5       | [{"buy":123426},{"cart":279829},{"fav":150844},{"pv":4653933}] |
| 6       | [{"buy":122171},{"cart":277093},{"fav":148561},{"pv":4642054}] |
| 7       | [{"buy":122728},{"cart":284269},{"fav":151321},{"pv":4806704}] |
| 8       | [{"buy":116444},{"cart":279035},{"fav":148722},{"pv":4607743}] |
| 9       | [{"buy":101300},{"cart":255342},{"fav":137631},{"pv":4203395}] |
| 10      | [{"buy":95907},{"cart":253193},{"fav":133262},{"pv":4313516}] |
| 11      | [{"buy":115032},{"cart":314774},{"fav":161057},{"pv":5430878}] |
| 12      | [{"buy":133859},{"cart":393209},{"fav":191406},{"pv":6586331}] |
| 13      | [{"buy":145431},{"cart":465924},{"fav":219974},{"pv":7538382}] |
| 14      | [{"buy":138263},{"cart":486249},{"fav":232222},{"pv":7443069}] |
| 15      | [{"buy":100070},{"cart":395920},{"fav":195330},{"pv":5599901}] |
| 16      | [{"buy":52422},{"cart":164776},{"fav":94930},{"pv":2747149}] |
| 17      | [{"buy":20948},{"cart":76954},{"fav":46239},{"pv":1278813}] |
| 18      | [{"buy":10748},{"cart":41541},{"fav":25079},{"pv":692240}] |
| 19      | [{"buy":7212},{"cart":29333},{"fav":16791},{"pv":471981}] |
| 20      | [{"buy":6044},{"cart":25564},{"fav":13260},{"pv":403765}] |
| 21      | [{"buy":7351},{"cart":33462},{"fav":17158},{"pv":522063}] |
| 22      | [{"buy":16251},{"cart":73014},{"fav":36388},{"pv":1097628}] |
| 23      | [{"buy":33718},{"cart":131106},{"fav":66638},{"pv":1982379}] |
+---------+----------------------------------------------------+
```

{% hint style="info" %}
比较以下语句

{% code lineNumbers="true" %}

```sql
select hour(`timestamp`) as hour,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior1
group by hour(`timestamp`)
order by hour;
```

{% endcode %}
{% endhint %}

#### 3.1.4 一周用户的活跃分布

```sql
--一周用户的活跃分布
create table res_behavior_among_week
comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select t1.weekday as weekday ,
       collect_list(map(t1.behavior_type ,
       (case when weekday=6 then ceil(t1.count/2)
              when weekday=7 then ceil(t1.count/2)
              else t1.count end)
       )) as ct
from(
       select pmod(datediff(t.d, '2017-11-25')+5, 7)+1 as weekday,
              t.behavior_type as behavior_type ,
              sum(t.count) as count
       from (
              select `date` as d,
                     behavior_type,
                     COUNT(*) as count
              from user_behavior1
              group by behavior_type , `date`
       ) t
       group by pmod(datediff(t.d, '2017-11-25')+5, 7)+1, t.behavior_type
       order by weekday
) t1
group by t1.weekday
order by weekday;

+----------+----------------------------------------------------+
| weekday  |                         tc                         |
+----------+----------------------------------------------------+
| 1        | [{"buy":218401},{"cart":539212},{"fav":289413},{"pv":8966429}] |
| 2        | [{"buy":211754},{"cart":533807},{"fav":289431},{"pv":8849193}] |
| 3        | [{"buy":223077},{"cart":554747},{"fav":299588},{"pv":9241648}] |
| 4        | [{"buy":222235},{"cart":573032},{"fav":304428},{"pv":9442000}] |
| 5        | [{"buy":212849},{"cart":642251},{"fav":314121},{"pv":10002284}] |
| 6        | [{"buy":230424},{"cart":685302},{"fav":355318},{"pv":10955227}] |
| 7        | [{"buy":224891},{"cart":626233},{"fav":322460},{"pv":10092439}] |
+----------+----------------------------------------------------+
```

{% hint style="info" %}
比较以下语句

{% code lineNumbers="true" %}

```sql
select pmod(datediff(`date`, '2017-11-25') + 5, 7)+1 as weekday,
       sum(case when behavior_type = 'pv' then 1 else 0 end) as pv,   --点击数
       sum(case when behavior_type = 'fav' then 1 else 0 end) as fav,  --收藏数
       sum(case when behavior_type = 'cart' then 1 else 0 end) as cart,  --加购物车数
       sum(case when behavior_type = 'buy' then 1 else 0 end) as buy  --购买数
from user_behavior1
group by pmod(datediff(`date`, '2017-11-25')+5, 7)+1
order by weekday;
```

{% endcode %}

{% code lineNumbers="true" %}

```sql
select t.weekday as weekday,
       collect_list(map(t.behavior_type, 
              (case when weekday=6 then ceil(t.count/2)
              when weekday=7 then ceil(t.count/2)
              else t.count end)
       )) as tc
from (
select pmod(datediff(`date`, '2017-11-25')+5, 7)+1 as weekday,
       behavior_type,
       COUNT(*) as count
from user_behavior1
group by behavior_type , pmod(datediff(`date`, '2017-11-25')+5, 7)+1
) t
group by weekday
order by weekday;
```

{% endcode %}
{% endhint %}

### 3.2 用户行为转换率

```sql
--点击/(加购物车+收藏)/购买 , 各环节转化率
create table res_conversion_rate
comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select pv, fav, cart, fav_cart, buy,
       round(fav_cart/pv, 4) as pv2favcart,
       round(buy/fav_cart, 4) as favcart2buy,
       round(buy/pv, 4) as pv2buy
from(
       select cast(gm['pv'] as int) as pv,
              cast(gm['fav'] as int) as fav,
              cast(gm['cart'] as int) as cart,
              cast(gm['fav']+gm['cart'] as int) as fav_cart,
              cast(gm['buy'] as int) as buy
       from(
              select collect(cast(behavior_type as string), cast(count as string)) as gm
              from(
                     select behavior_type,
                            COUNT(*) as count
                     from user_behavior1
                     group by behavior_type
              )t1
       ) t2
) t3;

+-----------+----------+----------+------------+----------+-------------+--------------+---------+
|    pv     |   fav    |   cart   |  fav_cart  |   buy    | pv2favcart  | favcart2buy  | pv2buy  |
+-----------+----------+----------+------------+----------+-------------+--------------+---------+
| 88596886  | 2852536  | 5466118  | 8318654.0  | 1998944  | 0.0939      | 0.2403       | 0.0226  |
+-----------+----------+----------+------------+----------+-------------+--------------+---------+
```

{% hint style="info" %}
这里使用了Brickhouse UDF，用collect UDF便捷实现了TRANSPOSE
{% endhint %}

### 3.3 复购率

```sql
--每个用户的购物情况，加工到 user_behavior_count
create table res_doublebuy_rate
comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select round(t2.count/uv,6) as rate
from(
    select count(*) as count
    from(
        select count(*) as count
        from user_behavior1
        where behavior_type='buy'
        group by user_id
    ) t1
    where t1.count>1
) t2
join(
    select uv from res_pv_uv
) t3;

+--------------------------+
| res_doublebuy_rate.rate  |
+--------------------------+
| 0.446562                 |
+--------------------------+
```

### 3.4 基于 RFM 模型找出有价值的用户

RFM 模型是衡量客户价值和客户创利能力的重要工具和手段，其中由3个要素构成了数据分析最好的指标，分别是：

* R-Recency（最近一次购买时间）
* F-Frequency（消费频率）
* ~~M-Money（消费金额）~~

```sql
-- 创建一个临时表，记录每个用户的R-F排名
create temporary table temp_cte
as
select user_id,
       datediff('2017-12-04', max(`date`)) as R,
       dense_rank() over(order by datediff('2017-12-04', max(`date`))) as R_rank,
       count(*) as F,
       dense_rank() over(order by count(*) desc) as F_rank
from user_behavior1
where behavior_type = 'buy'
group by user_id;

+--------------+--------+-------------+--------+-------------+
| cte.user_id  | cte.r  | cte.r_rank  | cte.f  | cte.f_rank  |
+--------------+--------+-------------+--------+-------------+
| 486458       | 1      | 1           | 262    | 1           |
| 866670       | 2      | 2           | 175    | 2           |
| 702034       | 1      | 1           | 159    | 3           |
| 107013       | 1      | 1           | 130    | 4           |
| 1014116      | 1      | 1           | 118    | 5           |
| 432739       | 1      | 1           | 112    | 6           |
| 500355       | 4      | 4           | 110    | 7           |
| 537150       | 1      | 1           | 109    | 8           |
| 1003412      | 1      | 1           | 100    | 9           |
| 919666       | 1      | 1           | 97     | 10          |
+--------------+--------+-------------+--------+-------------+
```

```sql
-- R-F按照5-15个等级划分，因为本人认为value-F > value-R，然后计算每个用户的价值score
create table res_user_score
comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select user_id, R, R_rank, R_score, F, F_rank, F_score,  R_score + F_score AS score
from(
select *,
       ntile(5) over(order by R_rank desc) as R_score,
       ntile(15) over(order by F_rank desc) as F_score
from temp_cte
) t
order by score desc;

+-------------------------+-------------------+------------------------+-------------------------+-------------------+------------------------+-------------------------+-----------------------+
| res_user_score.user_id  | res_user_score.r  | res_user_score.r_rank  | res_user_score.r_score  | res_user_score.f  | res_user_score.f_rank  | res_user_score.f_score  | res_user_score.score  |
+-------------------------+-------------------+------------------------+-------------------------+-------------------+------------------------+-------------------------+-----------------------+
| 180128                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
| 180131                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
| 935417                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
| 925900                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
| 590239                  | 1                 | 1                      | 5                       | 9                 | 79                     | 15                      | 20                    |
| 307773                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
| 856339                  | 1                 | 1                      | 5                       | 14                | 74                     | 15                      | 20                    |
| 178301                  | 1                 | 1                      | 5                       | 9                 | 79                     | 15                      | 20                    |
| 804543                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
| 327112                  | 1                 | 1                      | 5                       | 7                 | 81                     | 15                      | 20                    |
+-------------------------+-------------------+------------------------+-------------------------+-------------------+------------------------+-------------------------+-----------------------+
```

```sql
-- 计算每层价值用户有多少人
create table res_user_score_count
comment "page views and unique visitor each day"
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
as
select score, count(*) as count 
from res_user_score 
group by score 
order by score desc;

+-----------------------------+-----------------------------+
| res_user_score_count.score  | res_user_score_count.count  |
+-----------------------------+-----------------------------+
| 20                          | 22104                       |
| 19                          | 22648                       |
| 18                          | 33199                       |
| 17                          | 31014                       |
| 16                          | 41697                       |
| 15                          | 32152                       |
| 14                          | 71802                       |
| 13                          | 34853                       |
| 12                          | 47973                       |
| 11                          | 20117                       |
| 10                          | 25696                       |
| 9                           | 36893                       |
| 8                           | 39444                       |
| 7                           | 63359                       |
| 6                           | 30289                       |
| 5                           | 29306                       |
| 4                           | 41855                       |
| 3                           | 36437                       |
| 2                           | 9532                        |
+-----------------------------+-----------------------------+
```

### 3.5 商品维度的分析

#### 3.5.1 item\_id商品

对于不同的behavior\_type， 排名前50的商品，用pyhive分析后，转化表再插回。

<details>

<summary></summary>

```python
# %%
from pyhive import hive
import pandas as pd
import numpy as np
from TCLIService.ttypes import TOperationState
from tqdm import tqdm

conn = hive.connect('127.0.0.1', port=10000, username='hive', database="sky")
cursor = conn.cursor()

# %%
def single_cmd(cmd:str):
    pbar = tqdm(total=100)
    cursor.execute(cmd, async_=True)
    status = cursor.poll().operationState
    while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
        progress = cursor.poll().progressUpdateResponse.progressedPercentage
        pbar.update(int(progress * 100) - pbar.n)
        status = cursor.poll().operationState
    results = None
    try:results = cursor.fetchall()
    except Exception as e:
        pass
    pbar.close()
    if results:
        return pd.DataFrame(results, columns=[desc[0] for desc in cursor.description])
    else:
        return None

# %%

res = pd.DataFrame({})
for t in ['pv', 'cart', 'fav', 'buy']:
    tmp = single_cmd('''select item_id,
                    count(*) as count 
                    from user_behavior1
                    where behavior_type='{}'
                    group by item_id
                    order by count desc
                    limit 50
                    '''.format(t))
    # print(tmp.head())
    res[t] = [{k:v} for k,v in zip(tmp['item_id'], tmp['count'])]

# %%
single_cmd('''CREATE TABLE IF NOT EXISTS res_item_rank (
                `rank` int ,
                `pv` string ,
                `cart` string ,
                `fav` string ,
                `buy` string )
                row format delimited
                fields terminated by ','
                lines terminated by '\n'
                STORED AS TEXTFILE
            ''')

# %%
def multi_insert(table:str, values:list):
    cmd = "INSERT INTO " + table + " VALUES "
    for i in values:
        cmd += '("{}", "{}", "{}", "{}", "{}")'.format(i[0], i[1], i[2], i[3], i[4]) + ", "
    cmd = cmd[:-2]
    print(cmd)
    single_cmd(cmd)

# %%
res = res.astype(str)
res['rank'] = res.index
res['rank'] = res['rank'] + 1
res = res.reindex(columns=['rank', 'pv', 'cart', 'fav', 'buy'])
res = res.to_numpy()
print(res)

# %%
multi_insert("res_item_rank", res)

# %%
print(single_cmd('''select * from res_item_rank'''))

# %%
cursor.close()
conn.close()
```

</details>

```
+---------------------+---------------------+---------------------+--------------------+--------------------+
| res_item_rank.rank  |  res_item_rank.pv   | res_item_rank.cart  | res_item_rank.fav  | res_item_rank.buy  |
+---------------------+---------------------+---------------------+--------------------+--------------------+
| 1                   | {'812879': 29720}   | {'3031354': 1730}   | {'2279428': 971}   | {'3122135': 1408}  |
| 2                   | {'3845720': 25290}  | {'812879': 1514}    | {'2331370': 866}   | {'3031354': 942}   |
| 3                   | {'138964': 20927}   | {'2331370': 1502}   | {'812879': 861}    | {'3964583': 671}   |
| 4                   | {'2331370': 19348}  | {'2818406': 1403}   | {'2818406': 815}   | {'2560262': 658}   |
| 5                   | {'2032668': 19075}  | {'2560262': 1258}   | {'3330337': 709}   | {'2964774': 614}   |
| 6                   | {'1535294': 17830}  | {'138964': 1116}    | {'3845720': 695}   | {'740947': 553}    |
| 7                   | {'59883': 17313}    | {'1535294': 1114}   | {'1535294': 674}   | {'1910706': 546}   |
| 8                   | {'4211339': 17235}  | {'1583704': 1092}   | {'138964': 634}    | {'1116492': 512}   |
| 9                   | {'3371523': 17156}  | {'2453685': 1042}   | {'2453685': 632}   | {'705557': 495}    |
| 10                  | {'2338453': 17044}  | {'2279428': 981}    | {'2364679': 609}   | {'4443059': 490}   
```

#### 3.5.2 商品大类category\_id

<details>

<summary></summary>

```python
# %%
from pyhive import hive
import pandas as pd
import numpy as np
from TCLIService.ttypes import TOperationState
from tqdm import tqdm

conn = hive.connect('127.0.0.1', port=10000, username='hive', database="sky")
cursor = conn.cursor()

# %%
def single_cmd(cmd:str):
    pbar = tqdm(total=100)
    cursor.execute(cmd, async_=True)
    status = cursor.poll().operationState
    while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
        progress = cursor.poll().progressUpdateResponse.progressedPercentage
        pbar.update(int(progress * 100) - pbar.n)
        status = cursor.poll().operationState
    results = None
    try:results = cursor.fetchall()
    except Exception as e:
        pass
    pbar.close()
    if results:
        return pd.DataFrame(results, columns=[desc[0] for desc in cursor.description])
    else:
        return None

# %%

res = pd.DataFrame({})
for t in ['pv', 'cart', 'fav', 'buy']:
    tmp = single_cmd('''select category_id,
                    count(*) as count 
                    from user_behavior1
                    where behavior_type='{}'
                    group by category_id
                    order by count desc
                    limit 15
                    '''.format(t))
    # print(tmp.head())
    res[t] = [{k:v} for k,v in zip(tmp['category_id'], tmp['count'])]

# %%
single_cmd('''CREATE TABLE IF NOT EXISTS res_category_rank (
                `rank` int ,
                `pv` string ,
                `cart` string ,
                `fav` string ,
                `buy` string )
                row format delimited
                fields terminated by ','
                lines terminated by '\n'
                STORED AS TEXTFILE
            ''')

# %%
def multi_insert(table:str, values:list):
    cmd = "INSERT INTO " + table + " VALUES "
    for i in values:
        cmd += '("{}", "{}", "{}", "{}", "{}")'.format(i[0], i[1], i[2], i[3], i[4]) + ", "
    cmd = cmd[:-2]
    print(cmd)
    single_cmd(cmd)

# %%
res = res.astype(str)
res['rank'] = res.index
res['rank'] = res['rank'] + 1
res = res.reindex(columns=['rank', 'pv', 'cart', 'fav', 'buy'])
res = res.to_numpy()
print(res)

# %%
multi_insert("res_category_rank", res)

# %%
print(single_cmd('''select * from res_category_rank'''))

# %%
cursor.close()
conn.close()



```

</details>

```
+-------------------------+-----------------------+-------------------------+------------------------+------------------------+
| res_category_rank.rank  | res_category_rank.pv  | res_category_rank.cart  | res_category_rank.fav  | res_category_rank.buy  |
+-------------------------+-----------------------+-------------------------+------------------------+------------------------+
| 1                       | {'4756105': 4426937}  | {'4756105': 212652}     | {'4756105': 137659}    | {'1464116': 34248}     |
| 2                       | {'4145813': 3115400}  | {'4145813': 172399}     | {'4145813': 109352}    | {'2735466': 33426}     |
| 3                       | {'2355072': 3110550}  | {'982926': 151994}      | {'982926': 88052}      | {'2885642': 31619}     |
| 4                       | {'3607361': 2941480}  | {'4801426': 120584}     | {'2355072': 86088}     | {'4145813': 31418}     |
| 5                       | {'982926': 2763686}   | {'2355072': 120176}     | {'3607361': 70514}     | {'4756105': 28021}     |
| 6                       | {'2520377': 2003073}  | {'3607361': 107732}     | {'4801426': 67516}     | {'4801426': 26258}     |
| 7                       | {'4801426': 1841977}  | {'1320293': 93843}      | {'2520377': 66565}     | {'982926': 24570}      |
| 8                       | {'1320293': 1769460}  | {'2735466': 92518}      | {'2465336': 54162}     | {'2640118': 18116}     |
| 9                       | {'2465336': 1484268}  | {'2520377': 85070}      | {'3002561': 53801}     | {'4159072': 17917}     |
| 10                      | {'3002561': 1406307}  | {'2465336': 83140}      | {'1320293': 53117}     | {'1320293': 16948}     |
| 11                      | {'2735466': 1101503}  | {'3002561': 79667}      | {'4181361': 42485}     | {'3002561': 16330}     |
| 12                      | {'4181361': 990387}   | {'2640118': 61789}      | {'149192': 37840}      | {'4357323': 15686}     |
| 13                      | {'149192': 978867}    | {'149192': 61559}       | {'2735466': 35193}     | {'4789432': 15545}     |
| 14                      | {'1080785': 946692}   | {'4217906': 60537}      | {'4217906': 34854}     | {'903809': 15285}      |
| 15                      | {'2885642': 944780}   | {'1464116': 59858}      | {'1080785': 31928}     | {'4217906': 14310}     |
+-------------------------+-----------------------+-------------------------+------------------------+------------------------+
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://1138882663s-organization.gitbook.io/between-code-and-words/ji-shu-zhuan-ji/xiang-mu-program/xiang-mu-shi-jian/1-yi-tiao-tao-bao-yong-hu-hang-wei-shu-ju-fen-xi.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
