经济代写|博弈论代写Game Theory代考|Data Processing

Doug I. Jones

Doug I. Jones

Lorem ipsum dolor sit amet, cons the all tetur adiscing elit

如果你也在 怎样代写博弈论Game Theory这个学科遇到相关的难题,请随时右上角联系我们的24/7代写客服。


couryes-lab™ 为您的留学生涯保驾护航 在代写博弈论Game Theory方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写博弈论Game Theory代写方面经验极为丰富,各种代写博弈论Game Theory相关的作业也就用不着说。

我们提供的博弈论Game Theory及其相关学科的代写,服务范围广, 其中包括但不限于:

  • Statistical Inference 统计推断
  • Statistical Computing 统计计算
  • Advanced Probability Theory 高等概率论
  • Advanced Mathematical Statistics 高等数理统计学
  • (Generalized) Linear Models 广义线性模型
  • Statistical Machine Learning 统计机器学习
  • Longitudinal Data Analysis 纵向数据分析
  • Foundations of Data Science 数据科学基础

经济代写|博弈论代写Game Theory代考|Data Processing

Records from the Censys database are stored using JSON documents with deeply nested fields, containing location and ownership (AS) properties, as well as attributes extracted from parsed responses, including headers, banners, certificate chains, and so on. However, while these documents contain a wide range of characteristics about Internet hosts, the information cannot be fed into a classification model out of the box, and we need to convert these documents to numerical feature vectors for analysis by a machine learning model.

JSON documents follow a tree-like structure, allowing different fields to be nested inside one another, e.g., properties regarding the location of a host, including country, city, latitude, and longitude. Therefore, simply extracting tokens from the string corresponding to a JSON document fails to recognize its structure, and does not provide any information about the field from which the token was extracted.

To address the above problem, we use the approach developed by Sarabi and Liu (2018) to extract high-dimensional binary features vectors from these documents. This feature extraction algorithm first learns the schema (JSON Schema 2021) of JSON documents in the Censys database by inspecting a number of sample documents and then extracts binary features from each field according to the learned schema. This then produces features that can be attributed to fields of the original JSON documents and are extracted according to the data type of those fields (i.e., string, categorical, Boolean). Furthermore, for optional fields we can also generate features that reflect their existence in a document, e.g., open ports, or if a host is returning headers/banners for different protocols.
Figure 21.1 shows an example of how a JSON document can be transformed into a binary vector representation using this approach. Note that each generated feature is assigned to a certain field of the original JSON document, allowing us to separate features extracted from location and ownership (AS) information, as well as features extracted from different port responses. This allows us to gradually add the information of scanned ports to our models for performing predictions of remaining ports.

We train the feature extraction model from Sarabi and Liu (2018) on one million randomly drawn records from the 1/1/2019 Censys snapshot (chosen independently from the dataset detailed in Section 21.2.2), producing 14443 binary features extracted from 37 different ports. To control the number of generated features, we impose a limit of $0.05 \%$ on the sparsity of extracted features. These features are in the form of tags assigned to a host, e.g., if a host responds to probes on a certain port, if it belongs to a particular country, or if we observe certain tokens in fields inside the document, e.g., AS names, headers/banners, etc.

We exclude features that are extracted from Censys documents’ metadata, which are added by Censys by processing the information gathered from all scanned ports and cannot be assigned to a certain port. We further remove 11 ports that have been observed on less than $0.3 \%$ of active IP addresses, since we cannot collect enough samples on these ports for training robust models. We also remove port 3389 (RDP protocol), observed on $1.9 \%$ of active IPs, due to poor prediction performance, indicating that our feature set is not effective in predicting responses for this port. After pruning the feature set we obtain 13679 features from 20 ports, as well as location and AS properties, for training/evaluating our framework. Table 21.1 contains the fields/ports used for our analysis, as well as the number of features extracted from each field, and frequencies of active/open ports among active IP addresses.

经济代写|博弈论代写Game Theory代考|Features for Model Training

For training a model, we first produce labels for each port of an IP address by observing whether Censys has reported a response under said port for its record of that IP address. Note that for an inactive IP, all the produced labels are zero, meaning that no port is responding to requests. We then use different subsets of the binary features discussed in Section 21.2 .3 for training binary classifiers, as detailed below.

Prescan features These include features extracted from location and AS properties, which are available before performing any scans. These features provide a priori information about each host, which can be used as initial attributes for predicting port responses. Location information can help detect patterns in the behavior of IPs in different regions, while AS properties can help predict labels based on the type/owner of the IP address. For instance, observing the word “university” in an AS name can indicate an educational network, while “cable” can help recognize residential/ISP networks.

Postscan features Assuming that probes are performed sequentially, classifiers can also leverage features extracted from previous probes of an IP address for predicting the responses of the remaining ports. These then provide a posteriori features for classification. Note that using a stateless scanner such as ZMap (Durumeric et al. 2013), we only record whether a host has responded on a certain port, resulting in a single binary feature. However, with a stateful scan such as ZGrab (The ZMap Project 2021), a full handshake is completed with the server, and subsequent classifiers can also make use of parsed responses, resulting in a richer feature set. We evaluate both of these cases to determine the improvement provided by machine learning for stateless and stateful scans.


经济代写|博弈论代写Game Theory代考|Data Processing

来自 Censys 数据库的记录使用具有深度嵌套字段的 JSON 文档存储,包含位置和所有权 (AS) 属性,以及从解析的响应中提取的属性,包括标头、横幅、证书链等。然而,虽然这些文档包含有关 Internet 主机的广泛特征,但这些信息无法直接输入分类模型,我们需要将这些文档转换为数字特征向量,以便机器学习模型进行分析。

JSON 文档遵循树状结构,允许不同的字段相互嵌套,例如,有关主机位置的属性,包括国家、城市、纬度和经度。因此,简单地从对应于 JSON 文档的字符串中提取标记无法识别其结构,并且不会提供有关提取标记的字段的任何信息。

为了解决上述问题,我们使用 Sarabi 和 Liu (2018) 开发的方法从这些文档中提取高维二进制特征向量。该特征提取算法首先通过检查大量样本文档来学习 Censys 数据库中 JSON 文档的模式(JSON Schema 2021),然后根据学习到的模式从每个字段中提取二进制特征。然后,这会产生可归因于原始 JSON 文档字段的特征,并根据这些字段的数据类型(即字符串、分类、布尔值)提取这些特征。此外,对于可选字段,我们还可以生成反映它们在文档中存在的特征,例如,开放端口,或者主机是否返回不同协议的标头/横幅。
图 21.1 显示了如何使用此方法将 JSON 文档转换为二进制向量表示的示例。请注意,每个生成的特征都分配给原始 JSON 文档的某个字段,使我们能够分离从位置和所有权 (AS) 信息中提取的特征,以及从不同端口响应中提取的特征。这使我们能够逐渐将扫描端口的信息添加到我们的模型中,以执行剩余端口的预测。

我们在 2019 年 1 月 1 日的 Censys 快照(独立于第 21.2.2 节详述的数据集选择的数据集)中随机抽取一百万条记录,训练 Sarabi 和 Liu(2018)的特征提取模型,生成从 37 个不同的数据中提取的 14443 个二进制特征端口。为了控制生成的特征的数量,我们施加了一个限制0.05%关于提取特征的稀疏性。这些特征以分配给主机的标签的形式出现,例如,主机是否响应某个端口上的探测,它是否属于某个特定国家,或者我们是否在文档内的字段中观察到某些标记,例如 AS 名称、标题/横幅等

我们排除了从 Censys 文档的元数据中提取的特征,这些特征是由 Censys 通过处理从所有扫描端口收集的信息而添加的,并且不能分配给特定端口。我们进一步删除了 11 个已被观察到的端口少于0.3%活动 IP 地址的数量,因为我们无法在这些端口上收集足够的样本来训练稳健的模型。我们还删除了端口 3389(RDP 协议),观察到1.9%活动 IP 的数量,由于预测性能不佳,表明我们的功能集无法有效预测此端口的响应。修剪特征集后,我们从 20 个端口获得 13679 个特征,以及位置和 AS 属性,用于训练/评估我们的框架。表 21.1 包含用于我们分析的字段/端口,以及从每个字段中提取的特征数量,以及活动 IP 地址中活动/开放端口的频率。

经济代写|博弈论代写Game Theory代考|Features for Model Training

为了训练模型,我们首先通过观察 Censys 是否报告了该端口下的响应以记录该 IP 地址,从而为 IP 地址的每个端口生成标签。请注意,对于非活动 IP,所有生成的标签均为零,这意味着没有端口响应请求。然后,我们使用第 21.2 .3 节中讨论的二元特征的不同子集来训练二元分类器,如下所述。

预扫描特征 这些包括从位置和 AS 属性中提取的特征,这些特征在执行任何扫描之前可用。这些特征提供了关于每个主机的先验信息,这些信息可以用作预测端口响应的初始属性。位置信息可以帮助检测不同地区 IP 行为的模式,而 AS 属性可以帮助根据 IP 地址的类型/所有者预测标签。例如,观察 AS 名称中的“大学”一词可以指示教育网络,而“电缆”可以帮助识别住宅/ISP 网络。

后扫描特征 假设按顺序执行探测,分类器还可以利用从之前的 IP 地址探测中提取的特征来预测其余端口的响应。然后,这些提供了用于分类的后验特征。请注意,使用无状态扫描器,如 ZMap (Durumeric et al. 2013),我们仅记录主机是否已在特定端口上响应,从而产生单个二进制特征。但是,通过 ZGrab(ZMap 项目 2021)等有状态扫描,可以完成与服务器的完整握手,后续分类器也可以使用已解析的响应,从而产生更丰富的功能集。我们评估这两种情况以确定机器学习为无状态和有状态扫描提供的改进。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。







术语 广义线性模型(GLM)通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归,以及方差分析和方差分析(仅含固定效应)。



有限元是一种通用的数值方法,用于解决两个或三个空间变量的偏微分方程(即一些边界值问题)。为了解决一个问题,有限元将一个大系统细分为更小、更简单的部分,称为有限元。这是通过在空间维度上的特定空间离散化来实现的,它是通过构建对象的网格来实现的:用于求解的数值域,它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统,以模拟整个问题。然后,有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。





随机过程,是依赖于参数的一组随机变量的全体,参数通常是时间。 随机变量是随机现象的数量表现,其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值(如1秒,5分钟,12小时,7天,1年),因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中,往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录,以得到其自身发展的规律。


多元回归分析渐进(Multiple Regression Analysis Asymptotics)属于计量经济学领域,主要是一种数学上的统计分析方法,可以分析复杂情况下各影响因素的数学关系,在自然科学、社会和经济学等多个领域内应用广泛。


MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中,其中问题和解决方案以熟悉的数学符号表示。典型用途包括:数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发,包括图形用户界面构建MATLAB 是一个交互式系统,其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题,尤其是那些具有矩阵和向量公式的问题,而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问,这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展,得到了许多用户的投入。在大学环境中,它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域,MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要,工具箱允许您学习应用专业技术。工具箱是 MATLAB 函数(M 文件)的综合集合,可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。


hurry up

15% OFF

On All Tickets

Don’t hesitate and buy tickets today – All tickets are at a special price until 15.08.2021. Hope to see you there :)