愉快的学习就从翻译开始吧_Multivariate Time Series Forecasting with LSTMs in Keras_3_Multivariate LSTM Forecast LabelEncoder和OneHotEncoder 在特征工程中的应用

3. Multivariate LSTM Forecast Model/多变量LSTM预测模型

In this section, we will fit an LSTM to the problem.

本章,我们将一个LSTM拟合到这个问题

LSTM Data Preparation/LSTM 数据准备

The first step is to prepare the pollution dataset for the LSTM.

第一步是为LSTM准备污染数据集。

This involves framing the dataset as a supervised learning problem and normalizing the input variables.

这包括构造数据集为监督学习问题,和归一化输入变量(归一化?)

We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.

我们将把监督学习问题构建为给出前一个时间步的污染测量和天气条件,来预测当前时间的污染(这里的污染完全可以用PM2.5数据代替,就不会懵逼了).

This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:

这个公式很简单,只为这个演示,你可以探索一些其他的公式:

  • Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
    基于过去24小时的天气状况和污染预测下一个小时的污染。
  • Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.
    像上面一样预测污染并且给出下一小时的预期的天气状况

We can transform the dataset using the series_to_supervised() function developed in the blog post:

我们可以使用博客中开发的series_to_supervised() function来转换数据集

First, the “pollution.csv” dataset is loaded. The wind speed feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.

首先,‘pollution.csv’数据集被加载,风速特征是标签编码(整数编码)。 如果您有兴趣探索它,这可能会在未来进一步被热编码(你这样写,我怎么可能会懂呀骂人,看到后面再回头看大概懂了,大笑label encoded和one-hot encoded 是两种编码处理,就不该翻译成中文)。

Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.

接下来,所有特征被归一化,然后数据集被转化为监督学习问题,被预测的小时天气变量被去除。

The complete code listing is provided below.

完整代码清单,提供如下:

from pandas import DataFrame
from pandas import concat
from pandas import read_csv
from pandas import set_option
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler


# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence(t-n,... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d' % (j + 1, i)) for j in range(n_vars)]
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j + 1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg


# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:, 4] = encoder.fit_transform(values[:, 4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9, 10, 11, 12, 13, 14, 15]], axis=1, inplace=True)
set_option('display.max_columns', None)
print(reframed.head())

Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).

运行例子,打印出转换后数据集的前五行,我们看到有八个输入变量(输入序列),和一个输出变量(当前小时的污染水平)

   var1(t-1  var2(t-1  var3(t-1  var4(t-1  var5(t-1  var6(t-1  var7(t-1  \
1  0.129779  0.352941  0.245902  0.527273  0.666667  0.002290  0.000000   
2  0.148893  0.367647  0.245902  0.527273  0.666667  0.003811  0.000000   
3  0.159960  0.426471  0.229508  0.545454  0.666667  0.005332  0.000000   
4  0.182093  0.485294  0.229508  0.563637  0.666667  0.008391  0.037037   
5  0.138833  0.485294  0.229508  0.563637  0.666667  0.009912  0.074074   

   var8(t-1   var1(t)  
1       0.0  0.148893  
2       0.0  0.159960  
3       0.0  0.182093  
4       0.0  0.138833  
5       0.0  0.109658  

This data preparation is simple and there is more we could explore. Some ideas you could look at include:

这个数据准备工作很简单,我们可以探索更多。 您可以查看的一些想法包括:

  • One-hot encoding wind speed.
    One-hot encoding 风速
  • Making all series stationary with differencing and seasonal adjustment.
    通过差分和季节调整使数据稳定
  • Providing more than 1 hour of input time steps.
    提供超过1小时的输入时间步

This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.

在学习序列预测问题时,最后一点可能是最重要的,因为LSTM使用反向传播时间。

用到的知识点:

pandas.DataFrame.astype

DataFrame. astype ( dtypecopy=Trueerrors='raise'**kwargs ) [source]

Cast a pandas object to a specified dtype dtype.

Parameters:

dtype : data type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copy : bool, default True.

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors : {‘raise’, ‘ignore’}, default ‘raise’.

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

New in version 0.20.0.

raise_on_error : raise on invalid input

Deprecated since version 0.20.0: Use errors instead

kwargs  :  keyword arguments to pass on to the constructor
Returns:
casted  :  type of caller

See also

pandas.to_datetime
Convert argument to datetime.
pandas.to_timedelta
Convert argument to timedelta.
pandas.to_numeric
Convert argument to a numeric type.
numpy.ndarray.astype
Cast a numpy array to a specified type.

Examples

>>> ser = pd.Series([1, 2], dtype='int32')
>>> ser
0    1
1    2
dtype: int32
>>> ser.astype('int64')
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> ser.astype('category', ordered=True, categories=[2, 1])
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1,2])
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1  # note that s1[0] has changed too
0    10
1     2
dtype: int64

pandas.concat

pandas. concat ( objsaxis=0join='outer'join_axes=Noneignore_index=Falsekeys=Nonelevels=Nonenames=Noneverify_integrity=Falsesort=Nonecopy=True ) [source]

Concatenate pandas objects along a particular axis with optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Parameters:

objs : a sequence or mapping of Series, DataFrame, or Panel objects

If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised

axis : {0/’index’, 1/’columns’}, default 0

The axis to concatenate along

join : {‘inner’, ‘outer’}, default ‘outer’

How to handle indexes on other axis(es)

join_axes : list of Index objects

Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic

ignore_index : boolean, default False

If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

keys : sequence, default None

If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level

levels : list of sequences, default None

Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys

names : list, default None

Names for the levels in the resulting hierarchical index

verify_integrity : boolean, default False

Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation

sort : boolean, default None

Sort non-concatenation axis if it is not already aligned when join is ‘outer’. The current default of sorting is deprecated and will change to not-sorting in a future version of pandas.

Explicitly pass sort=True to silence the warning and sort. Explicitly pass sort=False to silence the warning and not sort.

This has no effect when join='inner', which already preserves the order of the non-concatenation axis.

New in version 0.23.0.

copy : boolean, default True

If False, do not copy data unnecessarily

Returns:

concatenated : object, type of objs

When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

Notes

The keys, levels, and names arguments are all optional.

A walkthrough of how this method fits in with other tools for combining pandas objects can be found here.

Examples

Combine two Series.

>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: object

Clear the existing index and reset it in the result by setting the ignore_index option to True.

>>> pd.concat([s1, s2], ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object

Add a hierarchical index at the outermost level of the data with the keys option.

>>> pd.concat([s1, s2], keys=['s1', 's2',])
s1  0    a
    1    b
s2  0    c
    1    d
dtype: object

Label the index keys you create with the names option.

>>> pd.concat([s1, s2], keys=['s1', 's2'],
...           names=['Series name', 'Row ID'])
Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object

Combine two DataFrame objects with identical columns.

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter', 'number'])
>>> df1
  letter  number
0      a       1
1      b       2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])
>>> df2
  letter  number
0      c       3
1      d       4
>>> pd.concat([df1, df2])
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.

>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
...                    columns=['letter', 'number', 'animal'])
>>> df3
  letter  number animal
0      c       3    cat
1      d       4    dog
>>> pd.concat([df1, df3])
  animal letter  number
0    NaN      a       1
1    NaN      b       2
0    cat      c       3
1    dog      d       4

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

>>> pd.concat([df1, df3], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame objects horizontally along the x axis by passing in axis=1.

>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
...                    columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
  letter  number  animal    name
0      a       1    bird   polly
1      b       2  monkey  george

Prevent the result from including duplicate index values with the verify_integrity option.

>>> df5 = pd.DataFrame([1], index=['a'])
>>> df5
   0
a  1
>>> df6 = pd.DataFrame([2], index=['a'])
>>> df6
   0
a  2
>>> pd.concat([df5, df6], verify_integrity=True)
Traceback (most recent call last):
    ...
ValueError: Indexes have overlapping values: ['a']

pandas.DataFrame.dropna

DataFrame. dropna ( axis=0how='any'thresh=Nonesubset=Noneinplace=False ) [source]

Remove missing values.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Determine if rows or columns which contain missing values are removed.

  • 0, or ‘index’ : Drop rows which contain missing values.
  • 1, or ‘columns’ : Drop columns which contain missing value.

Deprecated since version 0.23.0:: Pass tuple or list to drop on multiple

axes.

how : {‘any’, ‘all’}, default ‘any’

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

  • ‘any’ : If any NA values are present, drop that row or column.
  • ‘all’ : If all values are NA, drop that row or column.

thresh : int, optional

Require that many non-NA values.

subset : array-like, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplace : bool, default False

If True, do operation inplace and return None.

Returns:

DataFrame

DataFrame with NA entries dropped from it.

See also

DataFrame.isna
Indicate missing values.
DataFrame.notna
Indicate existing (non-missing) values.
DataFrame.fillna
Replace missing values.
Series.dropna
Drop missing values.
Index.dropna
Drop missing indices.

Examples

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
>>> df
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()
     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'born'])
       name        toy       born
1    Batman  Batmobile 1940-04-25

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)
>>> df
     name        toy       born
1  Batman  Batmobile 1940-04-25

突然想到Series 和 DataFrame有什么区别呢?

Series和DataFrame

range(10,0,-1)意思是从列表的下标为10的元素开始,倒序取到下标为0的元素(但是不包括下标为0元素),也就是说list[10]-list[1],转化成range就是相当于range(1,11)的倒序,最后得到的结果是[10,9,8,7,6,5,4,3,2,1]

header : int or list of ints, default ‘infer’

Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

index_col : int or sequence or False, default None

Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to _not_ use the first column as the index (row names)

LabelEncoder和OneHotEncoder 在特征工程中的应用

对于一些特征工程方面,有时会用到LabelEncoder和OneHotEncoder。

比如kaggle中对于性别,sex,一般的属性值是male和female。两个值。那么不靠谱的方法直接用0表示male,用1表示female 了。上面说了这是不靠谱的。

所以要用one-hot编码。

首先我们需要用LabelEncoder把sex这个属性列里面的离散属性用数字来表示,就是上面的过程,把male,female这种不同的字符的属性值,用数字表示。

以titanic 里面的train数据集为例.


Step1和step2解决的就是先fit所有样本的Sex属性值,就知道有多少个不同的属性值,有male和female,就用0和1表示,假如有3个不同的值,就用0,1,2表示。step2中transform操作就是转为数字表示形式。


但是转换成这样还不行,上面说过了,这样直接用数字表示的话,是不合理的,至于为什么不合理,待会引入scikit learn 中的原文。所以再把这些数字转化为one-hot编码形式。

这里就用OneHotEncoder



两行代码就把数值型表示转为了one-hot编码形式。

如果显示有省略号,根据情况加入如下代码

pd.set_option('display.height', 1000)  

pd.set_option('display.max_rows', 500)  

pd.set_option('display.max_columns', 500)  

pd.set_option('display.width'1000)  


猜你喜欢

转载自blog.csdn.net/dreamscape9999/article/details/80706886