目录

数据结构

Customer表

Transaction表

Store表

Review表

上传数据

创建目录用于存放数据

把本地文件上传到HDFS上

创建外部表

创建数据库

创建表

数据清洗

对transaction_details中的重复数据生成新ID

过滤掉store_review中没有评分的数据

找出PII (personal information identification‘个人信息识别’) 数据进行加密

重新组织transaction数据按照日期YYYY-MM(按月)做分区

Customer分析

1.找出顾客最常用的信用卡

2.找出户客资料中排名前五的职位名称

3.在美国女性最常用的信用卡

4.按性别和国家进行客户统计

Transaction

1.计算每月总收入

2.计算每个季度的总收入

3.按年计算总收入

4.按工作日计算总收入

5.按时间段计算总收入(需要清理数据)

6.按时间段计算平均消费

7.按工作日计算平均消费

8.计算年、月、日的交易总数

9.找出交易量最大的10个客户

10.找出消费最多的前10位顾客

11.计该期间交易数量最少的用户

12.算每个季度的独立客户总数

​13.算每周的独立客户总数

​14.算整个活动客户平均花费的最大值

​15.计每月花费最多的客户

16.计每月访问次数最多的客户

17.总价找出最受欢迎的5种产品

​18.据购买频率找出最畅销的5种产品

​19.据客户数量找出最受欢迎的5种产品

Store分析

1.按客流量找出最受欢迎的商店

2.根据顾客消费价格找出最受欢迎的商店

3.根据顾客交易情况找出最受欢迎的商店

4.根据商店和唯一的顾客id获取最受欢迎的产品

5.获取每个商店的员工与顾客比

6.按年和月计算每家店的收入

7.按店铺制作总收益饼图

8.找出每个商店最繁忙的时间段

9.找出每家店的忠实顾客

10.根据每位员工的最高收入找出明星商店

Review分析

1.在ext_store_review中找出存在冲突的交易映射关系

2.了解客户评价的覆盖率

3.根据评分了解客户的分布情况

4.根据交易了解客户的分布情况

5.客户给出的最佳评价是否总是同一家门店


使用工具zeppelin

数据结构

Customer表

customer_detailsdetails

customer_id

Int,1-500

first_name

string

last_name

string

email

string,suchaswillddy@gmail.com

gender

string,Maleorfemale

address

string

country

string

language

string

job

string,jobtitle/position

credit_type

string,creditcardtype,suchasvisa

credit_no

string,creditcardnumber

Transaction表

transaction_detailsdetails

transaction_id

Int,1-1000

customer_id

Int,1-500

store_id

Int,1-5

price

decimal,suchas5.08

product

string,thingsbought

date

string,whentopurchase

time

string,whattimetopurchase

Store表

store_detailsdetails

store_id

Int,1-5

store_name

string

employee_number

Int,在store有多少employee

Review表

store_reviewdetails

stransaction_id

Int,1-8000

store_id

Int,1-5

review_store

Int,1-5

上传数据

创建目录用于存放数据

%shhdfs dfs -mkdir -p /shopping/data/customerhdfs dfs -mkdir -p /shopping/data/storehdfs dfs -mkdir -p /shopping/data/reviewhdfs dfs -mkdir -p /shopping/data/transaction

把本地文件上传到HDFS上

%shcd /opt/stufile/storetransactionls -alhdfs dfs -put ./customer_details.csv /shopping/data/customerhdfs dfs -put ./store_details.csv /shopping/data/storehdfs dfs -put ./store_review.csv /shopping/data/reviewhdfs dfs -put ./transaction_details.csv /shopping/data/transaction

创建外部表

创建数据库

%hive-- 如果存在此表则删除drop database if exists shopping cascade;--创建数据库create database if not exists shopping;

创建表

%hivecreate external table if not exists ext_customer_detail(    customer_id string,    first_name  string,    last_name   string,    email       string,    gender      string,    address     string,    country     string,    language    string,    job         string,    credit_type string,    credit_no   string)    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'    location '/shopping/data/customer'    tblproperties ('skip.header.line.count' = '1');create external table if not exists ext_transaction_details(    transaction_id string,    customer_id    string,    store_id       string,    price          decimal(8, 2),    product        string,    purchase_date  string,    purchase_time  string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'    location '/shopping/data/transaction'    tblproperties ('skip.header.line.count' = '1');drop table ext_store_details;create external table if not exists ext_store_details(    store_id        string,    store_name      string,    employee_number string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'    location '/shopping/data/store'    tblproperties ('skip.header.line.count' = '1');drop table ext_store_review;create external table if not exists ext_store_review(    store_id       string,    transaction_id string,    review_score   string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'    location '/shopping/data/review'    tblproperties ('skip.header.line.count' = '1');

TIP:

OpenCSVSerde 默认的分隔符 (separator)、quote 以及逃逸字符(escape characters )分别为 、” 以及 ‘

如果我们查看表结构的时候,我们会发现如果 row format serde 为 org.apache.hadoop.hive.serde2.OpenCSVSerde,不管你建表的时候指定字段是什么类型,其显示的都是 string 类型

tblproperties (‘skip.header.line.count’ = ‘1’);从外部表导入数据跳过(忽略)首行。

数据清洗

对transaction_details中的重复数据生成新ID

%hiveuse shopping;with basetb as (    select row_number() over (partition by transaction_id order by transaction_id) as rn         , transaction_id         , customer_id         , store_id         , price         , product         , purchase_date         , purchase_time         , substr(purchase_date, 0, 7)                                             as purchase_month    from ext_transaction_details),     basetb2 as (         select `if`(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) transaction_id              , customer_id              , store_id              , price              , product              , purchase_date              , purchase_time              , purchase_month         from basetb)select *from basetb2where transaction_id like '%fix%';

因为查询的数据比较多,显示不全所有要在后面加一个查询

过滤掉store_review中没有评分的数据

查出有评分的数据

%hiveuse shopping;create view if not exists vm_store_review asselect *from ext_store_reviewwhere review_score  '';select * from vm_store_review;

找出PII (personal information identification‘个人信息识别’) 数据进行加密

%hiveuse shopping;drop view customer_detail;create view vm_customer_details asselect customer_id,first_name,unbase64(last_name)as last_name,unbase64(email) as email,gender,unbase64(address) as address,unbase64(concat(unbase64(credit_no),'hello')) as credit_no --二次加密from ext_customer_details;

重新组织transaction数据按照日期YYYY-MM(按月)做分区

%hiveuse shopping;set hive.exec.dynamic.partition.mode=nonstrict;with basetb as (    select row_number() over (partition by transaction_id order by transaction_id) as rn         , transaction_id         , customer_id         , store_id         , price         , product         , purchase_date         , purchase_time         , substr(purchase_date, 0, 7)                                             as purchase_month    from ext_transaction_details)insert overwrite table transaction_details partition (purchase_month)select `if`(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) transaction_id     , customer_id     , store_id     , price     , product     , purchase_date     , purchase_time     , purchase_monthfrom basetb;show partitions transaction_details;

Customer分析

1.找出顾客最常用的信用卡

%hiveuse shopping;select credit_type, max(credit_type) countsfrom ext_customer_detailsgroup by credit_typeorder by counts desc;

2.找出户客资料中排名前五的职位名称

select job, count(job) countsfrom ext_customer_detailsgroup by joborder by counts desclimit 5;

3.在美国女性最常用的信用卡

%hiveuse shopping;select credit_type, count(credit_type) countsfrom ext_customer_detailswhere gender = 'Female' and country='United States'group by credit_typeorder by counts desc;

4.按性别和国家进行客户统计

%hiveuse shopping;select country, gender,count(customer_id)from ext_customer_detailsgroup by gender, country;

Transaction

1.计算每月总收入

%hiveuse shopping;select substr(purchase_date,0,7) month,round(sum(price),2) sumfrom ext_transaction_detailsgroup by substr(purchase_date,0,7);

2.计算每个季度的总收入

with basetb as (    select concat_ws('-', cast(year(purchase_date) as string),                     cast(quarter(purchase_date) as string)) as year_quarter,           price    from transaction_details)select year_quarter ,sum(price) sumMoney from basetb group by year_quarter;

3.按年计算总收入

%hiveuse shopping;select year(purchase_date) years , round(sum(price),2) sumfrom ext_transaction_detailsgroup by year(purchase_date);

4.按工作日计算总收入

%hiveuse shopping;with basetb as (    select `dayofweek`(purchase_date) weekday, price    from transaction_details)select case           when (weekday - 1) = 1 then '星期一'           when (weekday - 1) = 2 then '星期二'           when (weekday - 1) = 3 then '星期三'           when (weekday - 1) = 4 then '星期四'           when (weekday - 1) = 5 then '星期五'           end as weekday,       sum(price) sumfrom basetbgroup by weekdayhaving weekday between 2 and 6;

5.按时间段计算总收入(需要清理数据)

%hiveuse shopping;with basetb1 as (    select price,           purchase_time,           case               when purchase_time like '%AM' then split(purchase_time, '\\s+')[0]               when purchase_time like '%PM' then concat_ws(':',cast(`if`(                                                                    (cast(split(purchase_time, ':')[0] as int) + 12) == 24,0,          (cast(split(purchase_time, ':')[0] as int) + 12)) as string)                   , split(split(purchase_time, ':')[1], '\\s+')[0])               else purchase_time               end time_format    from transaction_details),     basetb2 as (select price,                        purchase_time,                        (cast(split(time_format, ':')[0] as decimal(4, 2)) +                         cast(split(time_format, ':')[1] as decimal(4, 2)) / 60) purchase_time_num                 from basetb1),     basetb3 as (select price,                        purchase_time,                        `if`(purchase_time_num > 5 and purchase_time_num  8 and purchase_time_num  11 and purchase_time_num  13 and purchase_time_num  18 and purchase_time_num <= 22, 'evening','night'))))) as time_bucket                 from basetb2)select time_bucket, sum(price) sumfrom basetb3group by time_bucket;

6.按时间段计算平均消费

%hiveuse shopping;with basetb1 as (    select price,           purchase_time,           case               when purchase_time like '%AM' then split(purchase_time, '\\s+')[0]               when purchase_time like '%PM' then concat_ws(':',cast(`if`(                                                                    (cast(split(purchase_time, ':')[0] as int) + 12) == 24,0,          (cast(split(purchase_time, ':')[0] as int) + 12)) as string)                   , split(split(purchase_time, ':')[1], '\\s+')[0])               else purchase_time               end time_format    from transaction_details),     basetb2 as (select price,                        purchase_time,                        (cast(split(time_format, ':')[0] as decimal(4, 2)) +                         cast(split(time_format, ':')[1] as decimal(4, 2)) / 60) purchase_time_num                 from basetb1),     basetb3 as (select price,                        purchase_time,                        `if`(purchase_time_num > 5 and purchase_time_num  8 and purchase_time_num  11 and purchase_time_num  13 and purchase_time_num  18 and purchase_time_num <= 22, 'evening','night'))))) as time_bucket                 from basetb2)select time_bucket, avg(price) avgfrom basetb3group by time_bucket;

7.按工作日计算平均消费

%hiveuse shopping;with basetb as (    select `dayofweek`(purchase_date) weekday, price    from transaction_details)select case           when (weekday - 1) = 1 then '星期一'           when (weekday - 1) = 2 then '星期二'           when (weekday - 1) = 3 then '星期三'           when (weekday - 1) = 4 then '星期四'           when (weekday - 1) = 5 then '星期五'           end weekday,    avg(price) avgfrom basetbgroup by weekdayhaving weekday between 2 and 6;

8.计算年、月、日的交易总数

%hiveuse shopping;select distinct purchase_date,       purchase_month,       year(purchase_date),       count(1) over (partition by year(purchase_date))                                         years,       count(1) over (partition by year(purchase_date),month(purchase_date))                    months,       count(1) over (partition by year(purchase_date),month(purchase_date),day(purchase_date)) daysfrom transaction_details;

9.找出交易量最大的10个客户

%hiveuse shopping;selectcustomer_id,count(1) as numfrom transaction_detailsgroup by customer_idorder by num desclimit 10

10.找出消费最多的前10位顾客

%hiveuse shopping;select customer_id,       sum(price) as sumfrom transaction_detailsgroup by customer_idorder by sum desclimit 10

11.计该期间交易数量最少的用户

use shopping;select customer_id, count(transaction_id)from transaction_detailsgroup by customer_idorder by count(transaction_id)limit 1;


12.算每个季度的独立客户总数

%hiveuse shopping;with basetb as (    select distinct concat_ws('-', cast(year(purchase_date) as string),                     cast(quarter(purchase_date) as string)) as year_quarter,           customer_id    from transaction_details)select  year_quarter, count(customer_id) countsfrom basetbgroup by year_quarter;


13.算每周的独立客户总数

%hiveuse shopping;with basetb as (    select distinct concat(cast(year(purchase_date) as string), '-', cast(weekofyear(purchase_date)as string)) weeks,                    customer_id    from transaction_details)select weeks, count(customer_id) countsfrom basetbgroup by weeks;


14.算整个活动客户平均花费的最大值

%hiveuse shopping;select customer_id,avg(price) avgsfrom transaction_detailsgroup by customer_idorder by avgs desclimit 1;


15.计每月花费最多的客户

%hiveuse shopping;with basetb as (    select purchase_month,           customer_id,           sum(price) sum_price    from transaction_details    group by purchase_month, customer_id),     basetb2 as (         select purchase_month,                customer_id,                sum_price,                row_number() over (partition by purchase_month order by sum_price desc ) rn         from basetb)select purchase_month, customer_id, sum_pricefrom basetb2where rn = 1;


16.计每月访问次数最多的客户

%hiveuse shopping;with basetb as (    select purchase_month,           customer_id,           count(customer_id) counts    from transaction_details    group by purchase_month, customer_id),     basetb2 as (         select purchase_month,                customer_id,                counts,                row_number() over (partition by purchase_month order by counts desc ) rn         from basetb)select purchase_month, customer_id, countsfrom basetb2where rn = 1;


17.总价找出最受欢迎的5种产品

select product,sum(price) sumfrom transaction_detailsgroup by productorder by sum desclimit 5;


18.据购买频率找出最畅销的5种产品

select product,count(1) countsfrom transaction_detailsgroup by productorder by counts desclimit 5;


19.据客户数量找出最受欢迎的5种产品

select product,count(distinct customer_id) countsfrom transaction_detailsgroup by productorder by counts desclimit 5

Store分析

1.按客流量找出最受欢迎的商店

%hiveuse shopping;select store_name, count(distinct customer_id) countsfrom transaction_details td         join ext_store_details esd on td.store_id = esd.store_idgroup by store_nameorder by counts desc;

2.根据顾客消费价格找出最受欢迎的商店

%hiveuse shopping;select store_name, sum(price) sumsfrom transaction_details td         join ext_store_details esd on td.store_id = esd.store_idgroup by store_nameorder by sums desc;

3.根据顾客交易情况找出最受欢迎的商店

%hiveuse shopping;select store_name, count(td.store_id) countsfrom transaction_details td         join ext_store_details esd on td.store_id = esd.store_idgroup by store_nameorder by counts desc;

4.根据商店和唯一的顾客id获取最受欢迎的产品

%hiveuse shopping;with basetb as (    select store_id, product, count(distinct customer_id) counts    from transaction_details    group by store_id, product),     basetb2 as (         select store_id,                product,                counts,                rank() over (partition by store_id order by counts desc ) as rn         from basetb)select store_name, product, countsfrom basetb2 tb2         join ext_store_details esd on tb2.store_id = esd.store_idwhere rn = 1;

5.获取每个商店的员工与顾客比

%hiveuse shopping;with t1 as (select count(1) c1, store_id            from transaction_details td            group by td.store_id)select t1.store_id,       esd.store_name,       concat(substring(cast(esd.employee_number / t1.c1 as decimal(9, 8)) * 100.0, 0, 4), '%') proportionfrom t1         join ext_store_details esd on t1.store_id = esd.store_id;

6.按年和月计算每家店的收入

%hiveuse shopping;select distinct *from (         select store_id,                year(purchase_date)                                                     year,                sum(price) over (partition by year(purchase_date))                      sum_year,                month(purchase_date)                                                    month,                sum(price) over (partition by year(purchase_date),month(purchase_date)) sum_month         from transaction_details     ) tb;

7.按店铺制作总收益饼图

%hiveuse shopping;select store_id,sum(price)from transaction_detailsgroup by store_id

8.找出每个商店最繁忙的时间段

%hiveuse shopping;with basetb1 as (    select store_id,           customer_id,           purchase_time,           case               when purchase_time like '%AM' then split(purchase_time, '\\s+')[0]               when purchase_time like '%PM' then concat_ws(':',                                                            cast(`if`(                                                                    (cast(split(purchase_time, ':')[0] as int) + 12) == 24,                                                                    0,                                                                    (cast(split(purchase_time, ':')[0] as int) + 12)) as string)                   , split(split(purchase_time, ':')[1], '\\s+')[0])               else purchase_time               end time_format    from transaction_details),     basetb2 as (select store_id,                        customer_id,                        purchase_time,                        (cast(split(time_format, ':')[0] as decimal(4, 2)) +                         cast(split(time_format, ':')[1] as decimal(4, 2)) / 60) purchase_time_num                 from basetb1),     basetb3 as (select store_id,                        customer_id,                        purchase_time,                        `if`(purchase_time_num > 5 and purchase_time_num  8 and purchase_time_num  11 and purchase_time_num  13 and purchase_time_num  18 and purchase_time_num <= 22, 'evening',                                                 'night'))))) as time_bucket                 from basetb2)select esd.store_name,       tb3.time_bucket,       count(customer_id) countsfrom basetb3 tb3         join ext_store_details esd on tb3.store_id = esd.store_idgroup by esd.store_name, time_bucket;

9.找出每家店的忠实顾客

购买次数大于5,认为他是忠实粉丝

%hiveuse shopping;select *from (         select store_id, customer_id, count(1) counts         from transaction_details         group by store_id, customer_id) tbwhere tb.counts > 5;

10.根据每位员工的最高收入找出明星商店

%hiveuse shopping;with base as         (             select store_id, sum(price) s             from transaction_details             group by store_id         )select base.store_id,       base.s / store.employee_number enfrom base         join ext_store_details store              on base.store_id = store.store_idorder by en desclimit 1;

Review分析

1.在ext_store_review中找出存在冲突的交易映射关系

%hiveuse shopping;with basetb as (    select row_number() over (partition by transaction_id) as row_number1, * from vm_store_review)select row_number1, a.transaction_id, a.store_id, b.store_id, a.review_score, b.review_scorefrom basetb a         join vm_store_review b on a.transaction_id = b.transaction_idwhere row_number1 > 1;

2.了解客户评价的覆盖率

%hiveuse shopping;with t1 as (    select count(1) c1    from ext_store_review    where review_score  ''),     t2 as (         select count(1) c2         from ext_store_review         where review_score = ''     )select concat(cast((c1 - c2) / c1 * 100 as decimal(4, 2)), '%')  Coveragefrom t1 join t2;

3.根据评分了解客户的分布情况

%hiveuse shopping;select concat(round(sum(case review_score when '1' then 1 else 0 end) / count(*) * 100, 2), '%') as one_score,       concat(round(sum(case review_score when '2' then 1 else 0 end) / count(*) * 100, 2), '%') as two_score,       concat(round(sum(case review_score when '3' then 1 else 0 end) / count(*) * 100, 2), '%') as three_score,       concat(round(sum(case review_score when '4' then 1 else 0 end) / count(*) * 100, 2), '%') as four_score,       concat(round(sum(case review_score when '5' then 1 else 0 end) / count(*) * 100, 2), '%') as five_scorefrom ext_store_review;

4.根据交易了解客户的分布情况

根据总金额

%hiveuse shopping;select country,       sum(price) sum_pricefrom transaction_details td         join ext_customer_details cd on td.customer_id = cd.customer_idgroup by cd.country;

5.客户给出的最佳评价是否总是同一家门店

%hiveuse shopping;select store_id, customer_id, count(customer_id) countsfrom transaction_details td         join ext_store_review esr              on esr.transaction_id = td.transaction_idwhere esr.review_score = 5group by td.store_id, td.customer_id;