Hive 分區 – Sellkart 案例研究 - 戰略性地使用數據

已發表: 2015-08-14

Sellkart 是一個主要的印度電子商務網站和零售商店，與其他所有電子商務網站一樣，它必須存儲用戶數據和其銷售的所有商品的庫存。它使用 MySQL 數據庫，這是主要的數據存儲選擇之一。 MySQL 是安全、可靠和可靠的數據存儲。但隨著 Sellkart 越來越大，越來越多的客戶開始潛入這個電子商務門戶。現在 Sellkart 收集用戶數據，例如他們的位置、他們的行為、人口統計日誌、購買的物品和退回的物品、他們的付款方式，如 COD、信用卡/借記卡。分析師使用這些數據來預測客戶價值、更好的支付選項，並使用更好的廣告活動來擴展業務，從而簡化流程。

為了描述如何戰略性地使用收集到的數據，我們將在這裡考慮一個虛假的消費者數據集，其中包括用戶名、他們的城市、他們的職業、總數。購買的物品總數和總數。由於缺陷而退回的物品。

在假數據集（data.csv）中，我們有購買超過 25 件商品的客戶的數據。

data.csv 中的字段

用戶名（字符串）
城市（字符串）
職業（字符串）
已購買（整數）
返回（整數）

Sellkart 使用這些數據（具有多列和超過幾 GB 的相當大的數據集）來預測和理解其係統的關鍵洞察力，例如：

客戶的終生價值
產品成功
購買產品的人口統計
關注區域
產品實時定價
向上銷售吸引客戶

現在我們將展示如何在 Hadoop 中進行小範圍分析。我們將使用數據發現：

a) 根據他們購買的產品數量找到最大的消費者。

用途：為最佳買家提供特權，向普通客戶推銷。

b) 尋找購買最多產品的工作部門。

用途：保留更多暢銷產品，推薦給同類產品。

c) 查找購買產品最多的城市。

用途：區域明智地瞄準買家，專注於客戶較少的地區，在良好的市場中嘗試新產品。

1.將數據複製到HDFS。

a) 在 hadoop HDFS 中創建目錄

[training@localhost ~]$ hadoop fs -mkdir /user/training/case1

[training@localhost ~]$ hadoop fs -ls

Found 3 items

drwxr-xr-x - training supergroup 0 2015-07-21 16:48 /user/training/case1

-rw-r--r-- 1 training supergroup 44009445 2015-07-20 12:39 /user/training/crime

drwxr-xr-x - training supergroup 0 2015-07-20 14:21 /user/training/csvfile

b) 複製文件以供進一步使用

`[training@localhost ~]$ hadoop fs -copyFromLocal ~/Desktop/case1/data.csv /user/training/case1`
`[training@localhost ~]$ hadoop fs -ls /user/training/case1`
`Found 1 items`
`-rw-r--r-- 1 training supergroup 2489638 2015-07-21 16:49 /user/training/case1/data.csv`

此文件複製後，使用 Hive 進行處理，以獲得不同的結果。

2.使用hive處理數據。

a) 在 hive 中創建表，用於存儲這些數據。

`[training@localhost ~]$ hive`
`creatHive history file=/tmp/training/hive_job_log_training_201507211652_319213433.txt`
`hive> create table shopping (username string,city string,occupation string,purchaseditem,int,returneditem int) row format delimited fields`

`terminated by ',' stored as textfile`
`OK`
`Time taken: 2.892 seconds`
`hive> show tables;`
`OK`
`crimes`
`shopping`
`Time taken: 0.171 seconds`

b) 將文件加載到蜂巢表購物中

`Hive>load data inpath '/user/training/case1/data.csv' overwrite into table shopping;`
`Loading data to table default.shopping`
`Deleted hdfs://localhost/user/hive/warehouse/shopping`

`OK`
`Time taken: 2.798 seconds`

c) 了解表格的見解。

hive> 描述格式化購物；

好的

# col_name data_type 註釋             

用戶名字符串 無                
城市字符串 無                
職業字符串 無                
購買項目 int 無                
返回的項目 int 無                

# 詳細的表信息             
數據庫：默認                  
業主：培訓                 
創建時間：2015 年 7 月 21 日星期二 16:56:38 IST     
上次訪問時間：未知                  
保護模式：無                     
留存：0                        
位置：hdfs://localhost/user/hive/warehouse/shopping    
表類型：MANAGED_TABLE            
表參數：                
        瞬態_lastDdlTime 1437478210          

# 存儲信息            
SerDe 庫：org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe    
輸入格式：org.apache.hadoop.mapred.TextInputFormat         
輸出格式：org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat     
壓縮：否                       
桶數：-1                       
桶列：[]                       
排序列：[]                       
存儲描述參數：             
        字段.delim ,                   
        序列化.格式，                   
耗時：0.302 秒

3. 現在我們需要看看如何使用這些數據。

a) 尋找最好的客戶，為他們提供更多折扣。首先，我們將找到訂購的最高數量，然後我們將該數量與所有記錄相匹配，並找到最大的客戶。

hive> select * from shopping limit 5;

OK

Seattle Podiatric doctor 187 5

Detroit Lakes Engineering and natural sciences manager 168 11

Jackson Map editor 187 17

Miami Clinical manager 193 5

Santa Fe Springs Radiographer 207 20

Time taken: 0.335 seconds

hive> select max(purchaseditem) from shopping;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

Starting Job = job_201507212155_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201507212155_0001

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201507212155_0001

2015-07-21 23:16:01,887 Stage-1 map = 0%, reduce = 0%

2015-07-21 23:16:06,988 Stage-1 map = 100%, reduce = 0%

2015-07-21 23:16:16,075 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201507212155_0001

OK

250

Time taken: 18.794 seconds

hive> select * from shopping where purchaseditem = 250;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_201507212155_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201507212155_0002

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201507212155_0002

2015-07-21 23:18:31,586 Stage-1 map = 0%, reduce = 0%

2015-07-21 23:18:33,598 Stage-1 map = 100%, reduce = 0%

2015-07-21 23:18:34,608 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201507212155_0002

OK

Ablay1993 Seattle Ergonomist 250 7

Ablent71 Sugar Land Professional property manager 250 10

Affew1942 Alquippa Human resources consultant 250 13

Agar1976 Bell Oak Zookeeper 250 5

Akeem Manchester Marine 250 11

Alat1954 Columbus Assembler 250 7

Albertha Baton Rouge Mixing and blending machine operator 250 21

Amelia.Skiles Westfield Mediator 250 0

Amir_Spencer Polk City Epidemiologist 250 11

Angel.Thiel Hamilton Keying machine operator 250 21

Annalise_Langosh East Northport Fitness director 250 24

以此類推，現在公司可以找到頂級客戶，這一步可以在 MySQL 上輕鬆執行，但只有當數據在千兆位 Hadoop 時數據非常少時才會出現。

b) 現在分析師可以找到哪個工作部門是他們最好的客戶，即分析購買最多的人的工作，從而增加該部門的產品。

hive> select distinct occupation from shopping where purchaseditem = 250;;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

Starting Job = job_201507220045_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201507220045_0006

Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201507220045_0006

2015-07-22 01:09:52,904 Stage-1 map = 0%, reduce = 0%

2015-07-22 01:09:54,914 Stage-1 map = 100%, reduce = 0%

2015-07-22 01:10:01,968 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201507220045_0006

OK

Account clerk

Accounting clerk

Actuarie

Administrative assistant.

等等。

現在知道客戶文員是人們購買東西的最高職位之一，他們可以增加這個部門的銷售額。

4. 使用豬

現在我們將使用 pig 執行一些其他功能，例如從特定城市收集用戶名組，這些城市是頂級消費者。

a) 在 pig 中加載數據

`grunt> fulldata = load '/user/training/case1/' using PigStorage(',') as`
`(username:chararray,city:chararray,occupation:chararray, purchaseditem:int,returneditem:int); grunt>dump fulldata;`
`(username:chararray,city:chararray,occupation:chararray, purchaseditem:int,returneditem:int); grunt>dump fulldata;`

b) 現在我們將根據城市對頂級消費者的用戶名進行分組。

用於隔離頂級消費者的記錄
`Hive>top = filter fulldata by purchaseditem==250;`
現在將頂級消費城市分組，以獲取用戶名作為集合。
數據已分組但無序。
所以我們現在需要訂購。
所以通過這種方式，最終的數據是有序的。
現在我們可以使用這些數據了。同樣，我們可以提取消費者最少的城市，在這些城市中，公司可以為廣告和宣傳活動定義更多預算，以便更多人與門戶網站互動。
而對於消費率高的城市和消費者，公司可以推出新品，擴大圈子。
如果您有任何疑問或疑問，請在下面的評論框中提及。

Hive 分區 – Sellkart 案例研究 - 戰略性地使用數據

1.將數據複製到HDFS。

a) 在 hadoop HDFS 中創建目錄

b) 複製文件以供進一步使用

2.使用hive處理數據。

a) 在 hive 中創建表，用於存儲這些數據。

b) 將文件加載到蜂巢表購物中

Hive>load data inpath '/user/training/case1/data.csv' overwrite into table shopping; Loading data to table default.shopping Deleted hdfs://localhost/user/hive/warehouse/shopping (adsbygoogle = window.adsbygoogle || []).push({}); OK Time taken: 2.798 seconds

c) 了解表格的見解。

3. 現在我們需要看看如何使用這些數據。

4. 使用豬

現在我們將使用 pig 執行一些其他功能，例如從特定城市收集用戶名組，這些城市是頂級消費者。

a) 在 pig 中加載數據

b) 現在我們將根據城市對頂級消費者的用戶名進行分組。

`Hive>load data inpath '/user/training/case1/data.csv' overwrite into table shopping;`
`Loading data to table default.shopping`
`Deleted hdfs://localhost/user/hive/warehouse/shopping`

`OK`
`Time taken: 2.798 seconds`