下面先参阅下相关文档的内容:
http://www.cs.virginia.edu/hendawi/materials/PID4565219.pdf
The CAIDA anonymized Internet traces [5] weused consistsof traffic spanning 3 years (2012-2014).
CAIDA 匿名网络跟踪使用了跨越三年的网络数据。
“Caida anonymized Internet traces.”http://www.caida.org/data/passive/ passive 2014 dataset.xml.
CAIDA Centerfor Apllied Internet Data Analysis
The datasets contain anonymizedpassive traffic traces from CAIDA’s equinix-chicago andequinix-sanjose monitors (located in two Equinix datacenters inChicago and San Jose, respectively) on two high-speed (10 GigE)Internet backbone links.
匿名网络流量来自芝加哥和圣荷塞的Equinix的数据中心,两条高速(10G以太)因特网骨干线路。
Raw traces were taken on EndaceDAG cards and stripped of payload,i.e., the resulted pcap files (libpcap binaryformat) only include layer 3 (IPv4 and IPv6) and layer 4 (eg. TCP, UDP,ICMP)headers. The 2012 and 2013 datasets contain one onehour trace permonth, while the 2014 dataset only contains one one-hour traceper quarter due to the limited storage.
Endace DAG™ (Data Acquisition andGeneration)
原始数据获取由DAG卡负责,并且剥离了载荷,在pcap文件中只包含三层(IPv4 and IPv6)和四层(eg. TCP, UDP,ICMP)的数据头,2012 和2013的数据集中,有每月一个小时的数据,2014年,由于存储空间的限制,只有每个季度一个小时的追溯数据
Each one-hour traceconsists of 120 pcap files (each file contain sone-minute trafficof a single direction), each around 1 GB.
每一小的数据有120个pcap文件(每个文件是一分钟的单向数据),一共大约是一个G
Therefore, the total size of the three-yearCAIDA traces is about 4 TB.
所以,整个三年的数据量大约是4个T
Due to the scale of our cluster, we choseone hour trace from the equinix-chicago monitor in 2014 of size 54.3 GB
由于我们集群的规模,我们选择了2014年一个小时的芝加哥数据中心的数据,大约是54.3G。
With more computer nodes involved, oursystem would be able to support TB-scale traffic analysis.
随着更多的计算节点的加入,我们的系统将支持TB规模的数据流量的分析
We conducted our experiments using the Aptcluster, which is housed in the University of Utah’s Downtown Data Center in Salt LakeCity, Utah, through the CloudLab [32] user interface.
我们的实验使用的是Apt集群,它犹他州盐湖城,犹他大学的数据中心,通过云实验室的用户界面进行操作的。
We built a Hadoop cluster consisting of 11bare-metal nodes.
我们搭建的hadoop集群由11个裸机节点组成。
Each of the bare-metal nodes had 8 cores,16 GB RAM, 500GB disk space, and a 1-GE NIC. The OS on these hosts wasUbuntu14.04.4 LTS.
每个节点有8个核,16G内存,500GB硬盘,一个千兆以太口中。操作系统是Ubuntu14.04
We wrote a shell script to automate the installationand configuration processes of deploying Hadoop(Release 2.7.2) on the cluster.
我们写了一个脚本来进行自动安装配置布署Hadoopd集群
Oneof the hosts served as the master running YARN resource manager [33] and HDFSnamenode daemons, while the remaining two were slaves (named asslave1 and slave2) both running YARN nodemanager and HDFS datanode daemons.
其中一个主机作为YARN资源管理节点和namenode节点,其它的二个节点运行nodemanager和datanode程序。
Wealso installed Hive (Release 1.2.1) on top of Hadoop.
在Hadoop上我们安装了Hive
Inorder to achieve repeatable and reliable experimental results, we configure the
cluster to support CPU-based scheduling inYARN.
为了可以重复可靠地进行实验,我们配置了YARN的计划任务。
Bydoing this, each map or reduce task is allocated with the amount of CPU andmemory resources, and each container can only use its allocatedamount of resources without interfering with other containers.
一般说来,每个map或reduce任务都可使用一定的数量的CPU和内存资源,每个容器都能仅使用分配给他的资源,而不会影响其它容器
B. Experiment Description
Multiple experiments were undertaken toevaluate different aspects of Hobbits: (i) the correctness of Hobbits, (ii) thesplit overhead incurred by the splitter, (iii) the performance
of different table formats of Hive(external, text, ORC), (iv)the ORC format overhead, (v) the scalability, and(vi) a comparison between Hobbits and p3 performance. CAIDA
7
进行了多次实验,用以评估Hobbits系统的各个方面,系统的正确性,开销,HIVE的各中表格式,ORC开销,可扩展性,与P3的性能比较等等。
//External Table
CREATE EXTERNAL TABLE pcaps_ext
(ts bigint, ts_usec double, protocolstring,
src string, src_port int, dst string,
dst_port int, len int, ttl int)
ROW FORMAT SERDE
’net.ripe.hadoop.pcap.serde.PcapDeserializer’
STORED AS INPUTFORMAT
’net.ripe.hadoop.pcap.io.PcapInputFormat’
OUTPUTFORMAT
’org.apache.hadoop.hive.ql.io.
HiveignoreKeyTextOutputFormt’
LOCATION ’hdfs:///input’;
//Text Table
hive> show tables;
OK
pcaps_ext
Time taken: 0.022 seconds, Fetched: 2 row(s)