利用hadoop+hive抽取pcap源地址、目的地址、源端口和目的端口等相关数据

楚翰
2023-12-01

相关文档

下面先参阅下相关文档的内容:

http://www.cs.virginia.edu/hendawi/materials/PID4565219.pdf

The CAIDA anonymized Internet traces [5] weused consistsof traffic spanning 3 years (2012-2014).

CAIDA 匿名网络跟踪使用了跨越三年的网络数据。

“Caida anonymized Internet traces.”http://www.caida.org/data/passive/ passive 2014 dataset.xml.

CAIDA     Centerfor Apllied Internet Data Analysis

 

The datasets contain anonymizedpassive traffic traces from CAIDA’s equinix-chicago andequinix-sanjose monitors (located in two Equinix datacenters inChicago and San Jose, respectively) on two high-speed (10 GigE)Internet backbone links.

匿名网络流量来自芝加哥和圣荷塞的Equinix的数据中心,两条高速(10G以太)因特网骨干线路。

Raw traces were taken on EndaceDAG cards and stripped of payload,i.e., the resulted pcap files (libpcap binaryformat) only include layer 3 (IPv4 and IPv6) and layer 4 (eg. TCP, UDP,ICMP)headers. The 2012 and 2013 datasets contain one onehour trace permonth, while the 2014 dataset only contains one one-hour traceper quarter due to the limited storage.

 

Endace DAG™ (Data Acquisition andGeneration)

原始数据获取由DAG卡负责,并且剥离了载荷,在pcap文件中只包含三层(IPv4 and IPv6)和四层(eg. TCP, UDP,ICMP)的数据头,2012 和2013的数据集中,有每月一个小时的数据,2014年,由于存储空间的限制,只有每个季度一个小时的追溯数据

 

Each one-hour traceconsists of 120 pcap files (each file contain sone-minute trafficof a single direction), each around 1 GB.

每一小的数据有120个pcap文件(每个文件是一分钟的单向数据),一共大约是一个G

Therefore, the total size of the three-yearCAIDA traces is about 4 TB.

所以,整个三年的数据量大约是4个T

Due to the scale of our cluster, we choseone hour trace from the equinix-chicago monitor in 2014 of size 54.3 GB

由于我们集群的规模,我们选择了2014年一个小时的芝加哥数据中心的数据,大约是54.3G。

With more computer nodes involved, oursystem would be able to support TB-scale traffic analysis.

随着更多的计算节点的加入,我们的系统将支持TB规模的数据流量的分析

 

We conducted our experiments using the Aptcluster, which is housed in the University of Utah’s Downtown Data Center in Salt LakeCity, Utah, through the CloudLab [32] user interface.

我们的实验使用的是Apt集群,它犹他州盐湖城,犹他大学的数据中心,通过云实验室的用户界面进行操作的。

We built a Hadoop cluster consisting of 11bare-metal nodes.

我们搭建的hadoop集群由11个裸机节点组成。

Each of the bare-metal nodes had 8 cores,16 GB RAM, 500GB disk space, and a 1-GE NIC. The OS on these hosts wasUbuntu14.04.4 LTS.

每个节点有8个核,16G内存,500GB硬盘,一个千兆以太口中。操作系统是Ubuntu14.04

We wrote a shell script to automate the installationand configuration processes of deploying Hadoop(Release 2.7.2) on the cluster.

我们写了一个脚本来进行自动安装配置布署Hadoopd集群

 Oneof the hosts served as the master running YARN resource manager [33] and HDFSnamenode daemons, while the remaining two were slaves (named asslave1 and slave2) both running YARN nodemanager and HDFS datanode daemons.

其中一个主机作为YARN资源管理节点和namenode节点,其它的二个节点运行nodemanager和datanode程序。

 Wealso installed Hive (Release 1.2.1) on top of Hadoop.

在Hadoop上我们安装了Hive

 Inorder to achieve repeatable and reliable experimental results, we configure the

cluster to support CPU-based scheduling inYARN.

为了可以重复可靠地进行实验,我们配置了YARN的计划任务。

 Bydoing this, each map or reduce task is allocated with the amount of CPU andmemory resources, and each container can only use its allocatedamount of resources without interfering with other containers.

一般说来,每个map或reduce任务都可使用一定的数量的CPU和内存资源,每个容器都能仅使用分配给他的资源,而不会影响其它容器

B. Experiment Description

Multiple experiments were undertaken toevaluate different aspects of Hobbits: (i) the correctness of Hobbits, (ii) thesplit overhead incurred by the splitter, (iii) the performance

of different table formats of Hive(external, text, ORC), (iv)the ORC format overhead, (v) the scalability, and(vi) a comparison between Hobbits and p3 performance. CAIDA

7

进行了多次实验,用以评估Hobbits系统的各个方面,系统的正确性,开销,HIVE的各中表格式,ORC开销,可扩展性,与P3的性能比较等等。

//External Table

CREATE EXTERNAL TABLE pcaps_ext

(ts bigint, ts_usec double, protocolstring,

src string, src_port int, dst string,

dst_port int, len int, ttl int)

ROW FORMAT SERDE

’net.ripe.hadoop.pcap.serde.PcapDeserializer’

STORED AS INPUTFORMAT

’net.ripe.hadoop.pcap.io.PcapInputFormat’

OUTPUTFORMAT

’org.apache.hadoop.hive.ql.io.

HiveignoreKeyTextOutputFormt’

LOCATION ’hdfs:///input’;

//Text Table

具体步骤

package包的源代码

package net.ripe.hadoop.pcap.packet;


import java.util.HashMap;
import java.util.Map;


public class Packet extends HashMap<String, Object> {
private static final long serialVersionUID = 8723206921174160146L;


public static final String TIMESTAMP = "ts";
    public static final String TIMESTAMP_USEC = "ts_usec";
public static final String TIMESTAMP_MICROS = "ts_micros";
public static final String TTL = "ttl";
public static final String IP_VERSION = "ip_version";
public static final String IP_HEADER_LENGTH = "ip_header_length";
public static final String IP_FLAGS_DF = "ip_flags_df";
public static final String IP_FLAGS_MF = "ip_flags_mf";
public static final String IPV6_FLAGS_M = "ipv6_flags_m";
public static final String FRAGMENT_OFFSET = "fragment_offset";
public static final String FRAGMENT = "fragment";
public static final String LAST_FRAGMENT = "last_fragment";
public static final String PROTOCOL = "protocol";
public static final String SRC = "src";
public static final String DST = "dst";
public static final String ID = "id";
public static final String SRC_PORT = "src_port";
public static final String DST_PORT = "dst_port";
public static final String TCP_HEADER_LENGTH = "tcp_header_length";
public static final String TCP_SEQ = "tcp_seq";
public static final String TCP_ACK = "tcp_ack";
public static final String LEN = "len";
public static final String UDPSUM = "udpsum";
public static final String UDP_LENGTH = "udp_length";
public static final String TCP_FLAG_NS = "tcp_flag_ns";
public static final String TCP_FLAG_CWR = "tcp_flag_cwr";
public static final String TCP_FLAG_ECE = "tcp_flag_ece";
public static final String TCP_FLAG_URG = "tcp_flag_urg";
public static final String TCP_FLAG_ACK = "tcp_flag_ack";
public static final String TCP_FLAG_PSH = "tcp_flag_psh";
public static final String TCP_FLAG_RST = "tcp_flag_rst";
public static final String TCP_FLAG_SYN = "tcp_flag_syn";
public static final String TCP_FLAG_FIN = "tcp_flag_fin";
public static final String REASSEMBLED_TCP_FRAGMENTS = "reassembled_tcp_fragments";
public static final String REASSEMBLED_DATAGRAM_FRAGMENTS = "reassembled_datagram_fragments";

操作流程

编辑文档 pcapserde.hql
[username@hostname]$ cat pcapserde.hql
CREATE DATABASE IF NOT EXISTS pcapfile;
USE pcapfile;
ADD JAR hdfs:///hadoop-pcap-serde-0.2-SNAPSHOT-jar-with-dependencies.jar;
SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.max.split.size=104857600;
SET net.ripe.hadoop.pcap.io.header.class=net.ripe.hadoop.pcap.PcapReader;
create external table if not exists pcapfile.pcaps_ext
(ts bigint,protocol string,src string,src_port int,dst string,dst_port int,len int,ttl int)
row format serde 'net.ripe.hadoop.pcap.serde.PcapDeserializer'
stored as inputformat 'net.ripe.hadoop.pcap.io.PcapInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs:///路径';
运行hive
hive> show databases;
OK
default
pcapfile
Time taken: 0.018 seconds, Fetched: 3 row(s)

hive> show tables;
OK
pcaps_ext
Time taken: 0.022 seconds, Fetched: 2 row(s)

hive> source /home/clusteruser/opt/analyzer-pcap/tbl-description/pcapserde.hql;
OK
Time taken: 0.86 seconds
OK
Time taken: 0.014 seconds
Added [/tmp/925fd415-a1db-4712-a6fe-c9a7b722fb48_resources/hadoop-pcap-serde-0.2-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hadoop-pcap-serde-0.2-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 0.138 seconds
hive> select * from pcaps_ext limit 1,10;
OK
linkTypeVal = 1
1449396088      NULL    NULL    NULL    NULL    NULL    NULL    NULL
1449396088      UDP     172.16.10.250   1985    224.0.0.2       1985    20      1
1449396088      UDP     172.16.10.251   1985    224.0.0.2       1985    20      1
1449396089      UDP     172.16.10.250   1985    224.0.0.2       1985    20      1
1449396089      UDP     172.16.10.251   1985    224.0.0.2       1985    20      1
1449396090      NULL    NULL    NULL    NULL    NULL    NULL    NULL
1449396090      UDP     172.16.10.250   1985    224.0.0.2       1985    20      1
1449396090      UDP     172.16.10.251   1985    224.0.0.2       1985    20      1
1449396091      UDP     172.16.10.250   1985    224.0.0.2       1985    20      1
1449396091      UDP     172.16.10.251   1985    224.0.0.2       1985    20      1

实现数据提取

 类似资料: