ログの解析について - ロード・トゥ・ザ・ホワイトハッカー

1. URI
2. Method identifier
3. Number of arguments
4. Length of the arguments
5. Number of digits in the arguments
6. Number of other char in the arguments
7. Number of letters in the arguments
8. Length of the Host
9. Length of the header “Accept-Encoding”
10. Length of the header “Accept”
11. Length of the header “Accept-Language”
12. Length of the header “Accept-Charset”
13. Length of the header “Referer”
14. Length of the header “User-Agent”
15. Number of cookies
16. Length of the header “Cookie”
17. Content Length
18. Request Resource Type
19. Received Bytes
20. Possibility
21. Pattern Result

先に挙げた、

https://www.scutum.jp/information/waf_tech_blog/2021/01/waf-blog-077.html

では、以下の29個を特徴量としている。

%が最初に出現する場所
:が最初に出現する場所
:の個数(いくつ含まれるか)
(の個数
;の個数
%の個数
/の個数
'の個数
<の個数
?の個数
.の個数
#の個数
%3dの個数
%2fの個数
%5cの個数
%25の個数
%20の個数
メソッドがPOSTかどうか
URLのパス部分に含まれるアルファベットと数値以外の文字の個数
クエリ部分に含まれるアルファベットと数値以外の文字の個数
アルファベットと数値以外の文字が最も連続している部分の長さ
アルファベットと数値以外の文字の個数
/%の個数
//の個数
/.の個数
..の個数
=/の個数
./の個数
/?の個数

ドメイン知識に基づかない場合、

１．エントロピーを算出

２．N-gramとtr-idfを用いたベクトル化

３．文字列のインデックス化

４．文字列のアスキーコード化

などで、ログを数値化している。

１と２は、前回と前々回にこのブログで実施している。

３．文字列のインデックス化

については、使用する文字列を決めて、その位置番号により文字列化する方法。

コードとしては、以下のようになる。

import tensorflow as tf

import numpy as np

def data2char_index(X, max_len):

    alphabet = " abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+-=<>(){}"

    result =

    for data in X:

        mat =

        for ch in data:

            ch = ch.lower()

            if ch not in alphabet:

                continue

            mat.append(alphabet.index(ch))

        result.append(mat)

    X_char = tf.keras.preprocessing.sequence.pad_sequences(np.array(result, dtype=object),

            padding='post',truncating='post', maxlen=max_len)

    return X_char

４．文字列のアスキーコード化

これはPythonのord関数を使用して、文字コードを数値化するもの。

def convert_to_ascii(sentence):

    sentence_ascii=

    for i in sentence:

        if(ord(i)<8222):      # ” has ASCII of 8221

            if(ord(i)==8217): # ’  :  8217

                sentence_ascii.append(134)

            if(ord(i)==8221): # ”  :  8221

                sentence_ascii.append(129)

            if(ord(i)==8220): # “  :  8220

                sentence_ascii.append(130)

            if(ord(i)==8216): # ‘  :  8216

                sentence_ascii.append(131)

            if(ord(i)==8217): # ’  :  8217

                sentence_ascii.append(132)

            if(ord(i)==8211): # –  :  8211

                sentence_ascii.append(133)

            if (ord(i)<=128):

                    sentence_ascii.append(ord(i))

            else:

                pass

    zer=np.zeros*1

    for i in range(len(sentence_ascii)):

        zer[i]=sentence_ascii[i]

    zer.shape=(100, 100)

    return zer