TensorFlow: tutorials/word2vec/word2vec_basic.py

文本适量化是使用深度学习进行NLP的第一步，这里记录了word2vec_basic.py的代码解释。

1. 形成原始数据集

//所有的单词
vocabulary = read_data(filename)
//常见单词数量
vocabulary_size = 50000
// data: encoded vocabulary
// count: ("word", count)
// dictionary: {"word":encode}
// reverse_dictionary: {"encode":word}
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)

//count和编码后的data

count:
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
data:
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

2. 形成样本

data_index//数据的位置
batch_size//多大的数据计算一次loss
//How many words to consider left and right: 左右各skip_window个
skip_window = 1
//How many times to reuse an input to generate a label.: 如果左右各一个，则最多形成2个样本
num_skips = 2

['anarchism', 'originated', 'as', 'a', 'term', 'of']

input       label //samples
3084 originated -> 12 as
3084 originated -> 5239 anarchism
12 as        -> 3084 originated
12 as        -> 6 a
6 a        -> 12 as
6 a        -> 195 term
195 term    -> 2 of
195 term    -> 6 a

3. 样本、loss函数、优化器事loss最小、迭代执行

batch_size    = 128
embedding_size    = 128     # Dimension of the embedding vector.
skip_window    = 1     # How many words to consider left and right.
num_skips    = 2     # How many times to reuse an input to generate a label.
num_sampled    = 64     # Number of negative examples to sample.

//placeholder: should be fed
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

//vocabulary_size个单词，每个单词用embedding_size维度的数据表示
//从[vocabulary_size, embedding_size]大矩阵中得到[3084,3084,12,12,6,6,195,195]对应的矩阵 [batch_size, embedding_size] 也就是输入矢量
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

//Construct the variables for the NCE loss
//NCE是对vocabulary_size而言的不是batch_size
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size))
)
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

//Compute the average NCE loss for the batch.
loss = tf.reduce_mean(
tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))

//Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

//Begin training.
num_steps = 100001
with tf.Session(graph=graph) as session:
for step in xrange(num_steps):
batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

_,loss_val = session.run([optimizer, loss], feed_dict=feed_dict)

下面一坨是遇到的具体 python 函数的定义

统计单词在数据集中的出现次数:collections

Type: module: collections
String form: <module 'collections' from '/usr/lib/python2.7/collections.pyc'>
File: /usr/lib/python2.7/collections.py
Docstring:
This module implements specialized container datatypes providing
alternatives to Python's general purpose built-in containers, dict,
list, set, and tuple.

* deque list-like container with fast appends and pops on either end
* Counter dict subclass for counting hashable objects
* OrderedDict dict subclass that remembers the order entries were added

Dict subclass for counting hashable items. Sometimes called a bag
or multiset. Elements are stored as dictionary keys and their counts
are stored as dictionary values.

>>> c = Counter('abcdeabcdabcaba') # count elements from a string

>>> c.most_common(3) # three most common elements
[('a', 5), ('b', 4), ('c', 3)]

操作系统相关的库 os，用于跨系统移植

如解决文件路径中 "\" 还是 "/"等问题

Type: module: os
String form: <module 'os' from '/usr/lib/python2.7/os.pyc'>
File: /usr/lib/python2.7/os.py
Docstring:
OS routines for NT or Posix depending on what system we're on.

This exports:
- all functions from posix, nt, os2, or ce, e.g. unlink, stat, etc.
- os.path is one of the modules posixpath, or ntpath
- os.name is 'posix', 'nt', 'os2', 'ce' or 'riscos'
- os.curdir is a string representing the current directory ('.' or ':')
- os.pardir is a string representing the parent directory ('..' or '::')
- os.sep is the (or a most common) pathname separator ('/' or ':' or '\\')
- os.extsep is the extension separator ('.' or '/')
- os.altsep is the alternate pathname separator (None or '/')
- os.pathsep is the component separator used in $PATH etc
- os.linesep is the line separator in text files ('\r' or '\n' or '\r\n')
- os.defpath is the default search path for executables
- os.devnull is the file path of the null device ('/dev/null', etc.)

Programs that import and use 'os' stand a better chance of being
portable between different platforms. Of course, they must then
only use functions that are defined by all platforms (e.g., unlink
and opendir), and leave all pathname manipulation to os.path
(e.g., split and join).

sys 处理解释器相关的对象、函数

String form: <module 'sys' (built-in)>
Docstring: //解释器相关的对象、函数
This module provides access to some objects used or maintained by the
interpreter and to functions that interact strongly with the interpreter.

Dynamic objects:
argv -- command line arguments; argv[0] is the script pathname if known
path -- module search path; path[0] is the script directory, else ''
modules -- dictionary of loaded modules

argparse Python文件输入参数

Type: module: argparse
String form: <module 'argparse' from '/usr/lib/python2.7/argparse.pyc'>
File: /usr/lib/python2.7/argparse.py
Docstring:
Command-line parsing library

This module is an optparse-inspired command-line parsing library that:

The following is a simple usage example that sums integers from the
command-line and writes the result to a file::
parser = argparse.ArgumentParser(description='sum the integers at the command line')
parser.add_argument('integers', metavar='int', nargs='+', type=int, help='an integer to be summed')
parser.add_argument('--log', default=sys.stdout, type=argparse.FileType('w'), help='the file where the sum should be written')
args = parser.parse_args()
args.log.write('%s' % sum(args.integers))
args.log.close()

The module contains the following public classes:
- ArgumentParser -- The main entry point for command-line parsing. As the
example above shows, the add_argument() method is used to populate
the parser with actions for optional and positional arguments. Then
the parse_args() method is invoked to convert the args at the
command-line into an object with attributes.

tempfile and zipfile

Docstring: Temporary files.

This module provides generic, low- and high-level interfaces for
creating temporary files and directories.

This module also provides some data items to the user:
template - the default prefix for all temporary names.
You may change this to control the default prefix.
tempdir - If this is set to a string before the first use of
any routine from this module, it will be considered as
another candidate location to store temporary files.
Docstring: Read and write ZIP files.

six: Utilities for writing code that runs on Python 2 and 3

Docstring: Utilities for writing code that runs on Python 2 and 3

from six.moves import urllib
imports urllib when run with Python3 and imports a mixture of urllib, urllib2 and urlparse with Python2,
mimicking the structure of Python3's urllib.

numpy

Docstring:
NumPy

Provides
1. An array object of arbitrary homogeneous items
2. Fast mathematical operations over arrays
3. Linear Algebra, Fourier Transforms, Random Number Generation

array:

An array object represents a multidimensional, homogeneous array
of fixed-size items. An associated data-type object describes the
format of each element in the array (its byte-order, how many bytes it
occupies in memory, whether it is an integer, a floating point number,
or something else, etc.)

创建array的两种方式：array/ ndarray

>>> np.array([1, 2, 3])
array([1, 2, 3])

>>> np.array([[1, 2], [3, 4]])
array([[1, 2],
[3, 4]])
>>> np.array([1, 2, 3], ndmin=2)
array([[1, 2, 3]])
ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None)

TensorFlow

tf.placeholder

Signature: tf.placeholder(dtype, shape=None, name=None)
Docstring:
Inserts a placeholder for a tensor that will be always fed.

For example:

```python
x = tf.placeholder(tf.float32, shape=(1024, 1024))
y = tf.matmul(x, x)

with tf.Session() as sess:
print(sess.run(y)) # ERROR: will fail because x was not fed.

rand_array = np.random.rand(1024, 1024)
print(sess.run(y, feed_dict={x: rand_array})) # Will succeed.
```

@compatibility(eager)
Placeholders are not compatible with eager execution.
@end_compatibility

Args:
dtype: The type of elements in the tensor to be fed.
shape: The shape of the tensor to be fed (optional). If the shape is not
specified, you can feed a tensor of any shape.
name: A name for the operation (optional).

Returns:
A `Tensor` that may be used as a handle for feeding a value, but not
evaluated directly.

tf.Variable

Init signature: tf.Variable(cls, *args, **kwargs)
Docstring:
See the [Variables Guide](https://tensorflow.org/guide/variables).

A variable maintains state in the graph across calls to `run()`. You add a
variable to the graph by constructing an instance of the class `Variable`.

The `Variable()` constructor requires an initial value for the variable,
which can be a `Tensor` of any type and shape. The initial value defines the
type and shape of the variable. After construction, the type and shape of
the variable are fixed. The value can be changed using one of the assign
methods.

tf.constant

Docstring:
Creates a constant tensor.

# Constant 1-D Tensor populated with value list.
tensor = tf.constant([1, 2, 3, 4, 5, 6, 7]) => [1 2 3 4 5 6 7]
# Constant 2-D tensor populated with scalar value -1.
tensor = tf.constant(-1.0, shape=[2, 3]) => [[-1. -1. -1.]
[-1. -1. -1.]]

tf.name_scope /context manager(with)

http://www.bjhee.com/python-context.html
谈一谈Python的上下文管理器

# name_scope的意思是下面的tensor有共同的名字inputs
# placeholder/constant的构造函数也可加入name="xxx",否则是默认值
with tf.name_scope('inputs'):
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
In [161]: with tf.Session() as sess:
...: print train_inputs.name
...: print train_labels.name
...: print valid_dataset.name
...:
inputs/Placeholder:0
inputs/Placeholder_1:0
inputs/Const:0

test/train:0
test/label:0

tf.reduce_mean

Signature: tf.reduce_mean(input_tensor, axis=None, keepdims=None, name=None, reduction_indices=None, keep_dims=None)
Docstring: //计算某个维度的均值: mean
Computes the mean of elements across dimensions of a tensor. (deprecated arguments)

SOME ARGUMENTS ARE DEPRECATED. They will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead

Reduces `input_tensor` along the dimensions given in `axis`.
Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
entry in `axis`. If `keepdims` is true, the reduced dimensions
are retained with length 1.

If `axis` is None, all dimensions are reduced, and a
tensor with a single element is returned.

For example:
```python
x = tf.constant([[1., 1.], [2., 2.]])
tf.reduce_mean(x) # 1.5
tf.reduce_mean(x, 0) # [1.5, 1.5]
tf.reduce_mean(x, 1) # [1., 2.]
```

tf.nn

nn/ Docstring: Wrappers for primitive Neural Net (NN) Operations.

tf.nn.embedding_lookup//作用就是从params中抽取ids

Signature: tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None, validate_indices=True, max_norm=None)
Docstring:
Looks up `ids` in a list of embedding tensors.

This function is used to perform parallel lookups on the list of tensors in `params`.

In [180]: a = [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3], [2.1, 2.2, 2.3], [3.1, 3.2, 3.3], [4.1, 4.2, 4.3]]
...: a = np.asarray(a)
...: idx1 = tf.Variable([0, 2, 3, 1], tf.int32)
...: idx2 = tf.Variable([[0, 2, 3, 1], [4, 0, 2, 2]], tf.int32)
...: out1 = tf.nn.embedding_lookup(a, idx1)
...: out2 = tf.nn.embedding_lookup(a, idx2)
...: init = tf.global_variables_initializer()
...:
print a:
[[0.1 0.2 0.3]
[1.1 1.2 1.3]
[2.1 2.2 2.3]
[3.1 3.2 3.3]
[4.1 4.2 4.3]]

In [181]: with tf.Session() as sess:
...: sess.run(init)
...: print sess.run(out1)
...: print out1
...: print '=================='
...: print sess.run(out2)
...: print out2
...:
//tf.Variable([0, 2, 3, 1], tf.int32),取 parameter中的行[0, 2, 3, 1],列不关心
[[0.1 0.2 0.3]
[2.1 2.2 2.3]
[3.1 3.2 3.3]
[1.1 1.2 1.3]]
Tensor("embedding_lookup/Identity:0", shape=(4, 3), dtype=float64)
==================
//tf.Variable([[0, 2, 3, 1], [4, 0, 2, 2]], tf.int32),先取[],再取[],多一个维度而已
[[[0.1 0.2 0.3]
[2.1 2.2 2.3]
[3.1 3.2 3.3]
[1.1 1.2 1.3]]

[[4.1 4.2 4.3]
[0.1 0.2 0.3]
[2.1 2.2 2.3]
[2.1 2.2 2.3]]]
Tensor("embedding_lookup_1/Identity:0", shape=(2, 4, 3), dtype=float64)

// ---------------------------------
In [186]: tf.truncated_normal?
Signature: tf.truncated_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32, seed=None, name=None)
Docstring:
Outputs random values from a truncated normal distribution.

The generated values follow a normal distribution with specified mean and
standard deviation, except that values whose magnitude is more than 2 standard
deviations from the mean are dropped and re-picked.

tf.nn.nce_loss 负采样

tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1, sampled_values=None, remove_accidental
_hits=False, partition_strategy='mod', name='nce_loss')
Docstring:
Computes and returns the noise-contrastive estimation training loss.

See [Noise-contrastive estimation: A new estimation principle for
unnormalized statistical
models](http://www.jmlr.org/proceedings/papers/v9/gutmann10a/gutmann10a.pdf).
Also see our [Candidate Sampling Algorithms
Reference](https://www.tensorflow.org/extras/candidate_sampling.pdf)

A common use case is to use this method for training, and calculate the full
sigmoid loss for evaluation or inference.

tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size)
Args:
weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
objects whose concatenation along dimension 0 has shape
[num_classes, dim]. The (possibly-partitioned) class embeddings.
biases: A `Tensor` of shape `[num_classes]`. The class biases.
labels: A `Tensor` of type `int64` and shape `[batch_size, num_true]`. The target classes.
inputs: A `Tensor` of shape `[batch_size, dim]`. The forward activations of the input network.
num_sampled: An `int`. The number of negative classes to randomly sample
per batch. This single sample of negative classes is evaluated for each
element in the batch.
num_classes: An `int`. The number of possible classes.

tf.Session

Session封装了一个环境：在这个环境中执行Operation被执行，评估Tensor对象
A `Session` object encapsulates the environment in which `Operation`
objects are executed, and `Tensor` objects are evaluated. For
example:

```python
# Build a graph.
a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b

# Launch the graph in a session.
sess = tf.Session()

# Evaluate the tensor `c`.
print(sess.run(c))
```

A session may own resources, such as
`tf.Variable`, `tf.QueueBase`,
and `tf.ReaderBase`. It is important to release
these resources when they are no longer required. To do this, either
invoke the `tf.Session.close` method on the session, or use
the session as a context manager. The following two examples are
equivalent:

```python
1] # Using the `close()` method.
sess = tf.Session()
sess.run(...)
sess.close()

2] # Using the context manager.
http://www.bjhee.com/python-context.html
谈一谈Python的上下文管理器
with tf.Session() as sess:
sess.run(...)

tf.Session.run

Signature: tf.Session.run(self, fetches, feed_dict=None, options=None, run_metadata=None)
Docstring:
Runs operations and evaluates tensors in `fetches`.

This method runs one "step" of TensorFlow computation, by
running the necessary graph fragment to execute every `Operation`
and evaluate every `Tensor` in `fetches`, substituting the values in
`feed_dict` for the corresponding input values.

The `fetches` argument may be a single graph element, or an arbitrarily
nested list, tuple, namedtuple, dict, or OrderedDict containing graph
elements at its leaves. A graph element can be one of the following types:

* An `tf.Operation`.
The corresponding fetched value will be `None`.
* A `tf.Tensor`.
The corresponding fetched value will be a numpy ndarray containing the
value of that tensor.

The value returned by `run()` has the same shape as the `fetches` argument,
where the leaves are replaced by the corresponding values returned by
TensorFlow.