70 NumPy high-frequency operations commonly used by data analysis workers (part 2)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

#方法1
np.corrcoef(iris[:, 0], iris[:, 2])[0, 1]

#方法2
from scipy.stats.stats import pearsonr  
corr, p_value = pearsonr(iris[:, 0], iris[:, 2])
print(corr)

37. Determine whether there is a null value in numpy.ndarray

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

np.isnan(iris_2d).any()

38. Use the specified value to replace the default value in numpy.ndarray

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan


iris_2d[np.isnan(iris_2d)] = 0#使用0替代缺省值
iris_2d[:4]

39. Calculate the frequency of numpy.ndarray elements

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

species = np.array([row.tolist()[4] for row in iris])

# Get the unique values and the counts
np.unique(species, return_counts=True)

40. Convert numpy.ndarray elements from numerical type to sub-type

'''
需求：
Less than 3 --> 'small'
3-5 --> 'medium'
'>=5 --> 'large'
'''

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Bin petallength 
petal_length_bin = np.digitize(iris[:, 2].astype('float'), [0, 3, 5, 10])

# Map it to respective category
label_map = {1: 'small', 2: 'medium', 3: 'large', 4: np.nan}
petal_length_cat = [label_map[x] for x in petal_length_bin]

# View
petal_length_cat[:4]

41. Get a new column from the known column of numpy.ndarray

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

#计算新列
sepallength = iris_2d[:, 0].astype('float')
petallength = iris_2d[:, 2].astype('float')
volume = (np.pi * petallength * (sepallength**2))/3

# 转换为iris_2d大小
volume = volume[:, np.newaxis]

#添加新列
out = np.hstack([iris_2d, volume])
out[:4]

42. numpy.ndarray probability sampling

#需求：抽样结果使得species中setose is twice the number of versicolor and virginica
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Get the species column
species = iris[:, 4]

#方法1
np.random.seed(100)
a = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
species_out = np.random.choice(a, 150, p=[0.5, 0.25, 0.25])

#方法2
np.random.seed(100)
probs = np.r_[np.linspace(0, 0.500, num=50), np.linspace(0.501, .750, num=50), np.linspace(.751, 1.0, num=50)]
index = np.searchsorted(probs, np.random.random(150))
species_out = species[index]
print(np.unique(species_out, return_counts=True))

43. Find the second largest element after numpy.ndarray is classified according to a certain index

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Get the species and petal length columns
petal_len_setosa = iris[iris[:, 4] == b'Iris-setosa', [2]].astype('float')

# Get the second last value
np.unique(np.sort(petal_len_setosa))[-2]

44. Sort by a column of numpy.ndarray

import numpy as np
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')
print(iris[iris[:,0].argsort()][:20])#按第一列排序

45. Pick the element with the highest frequency in numpy.ndarray

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

vals, counts = np.unique(iris[:, 2], return_counts=True)
print(vals[np.argmax(counts)])

46. Output the position of the numpy.ndarray that is greater than the given element for the first time

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

np.argwhere(iris[:, 3].astype(float) > 1.0)[0]

47. Replace the elements that meet the conditions in numpy.ndarray with the given value

#需求：numpy.ndarray中大于30的用30替换、小于10的用10替换
np.set_printoptions(precision=2)
np.random.seed(100)
a = np.random.uniform(1,50, 20)

#方法1
np.clip(a, a_min=10, a_max=30)

#方法2
print(np.where(a < 10, 10, np.where(a > 30, 30, a)))

48.Get the element position and element of the top n in numpy.ndarray

np.random.seed(100)
a = np.random.uniform(1,50, 20)

##获取numpy.ndarray中大小排前5的元素位置
#方法1
print(a.argsort())

#方法2
np.argpartition(-a, 5)[:5]

##获取numpy.ndarray中大小排前5的元素
#方法1
a[a.argsort()][-5:]

#方法2
np.sort(a)[-5:]

#方法3
np.partition(a, kth=-5)[-5:]

#方法4
a[np.argpartition(-a, 5)][:5]

49、求numpy.ndarray的row wise counts

np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
print(arr)
def counts_of_all_values_rowwise(arr2d):
    # Unique values and its counts row wise
    num_counts_array = [np.unique(row, return_counts=True) for row in arr2d]

    # Counts of all values row wise
    return([[int(b[a==i]) if i in a else 0 for i in np.unique(arr2d)] for a, b in num_counts_array])

print(np.arange(1,11))
counts_of_all_values_rowwise(arr)

50, multiple numpy.ndarray into one

arr1 = np.arange(3)
arr2 = np.arange(3,7)
arr3 = np.arange(7,10)

array_of_arrays = np.array([arr1, arr2, arr3])
print('array_of_arrays: ', array_of_arrays)

#方法
arr_2d = np.array([a for arr in array_of_arrays for a in arr])

#方法2
arr_2d = np.concatenate(array_of_arrays)
print(arr_2d)

51、计算numpy.ndarray的one-hot encodings numpy.ndarray

np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
print(arr)

# Solution:
def one_hot_encodings(arr):
    uniqs = np.unique(arr)
    out = np.zeros((arr.shape[0], uniqs.shape[0]))
    for i, k in enumerate(arr):
        out[i, k-1] = 1
    return out

one_hot_encodings(arr)

52、create row numbers grouped by a categorical variable

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
species = np.genfromtxt(url, delimiter=',', dtype='str', usecols=4)
np.random.seed(100)
species_small = np.sort(np.random.choice(species, size=20))
print(species_small)

print([i for val in np.unique(species_small) for i, grp in enumerate(species_small[species_small==val])])

53、create groud ids based on a given categorical variable

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
species = np.genfromtxt(url, delimiter=',', dtype='str', usecols=4)
np.random.seed(100)
species_small = np.sort(np.random.choice(species, size=20))
print(species_small)
output = [np.argwhere(np.unique(species_small) == s).tolist()[0][0] for val in np.unique(species_small) for s in species_small[species_small==val]]
output

54. numpy.ndarray (one-dimensional) element rank

np.random.seed(10)
a = np.random.randint(20, size=10)
print('Array: ', a)


print(a.argsort().argsort())

55, numpy.ndarray (multi-dimensional) element rank

np.random.seed(10)
a = np.random.randint(20, size=[2,5])
print(a)

print(a.ravel().argsort().argsort().reshape(a.shape))

56. Output the largest element of each row of numpy.ndarray

np.random.seed(100)
a = np.random.randint(1,10, [5,3])
print(a)

# 方法1
np.amax(a, axis=1)

#方法2
np.apply_along_axis(np.max, arr=a, axis=1)

57. Output the ratio of the minimum value to the maximum value of each row of numpy.ndarray

np.random.seed(100)
a = np.random.randint(1,10, [5,3])
print(a)

np.apply_along_axis(lambda x: np.min(x)/np.max(x), arr=a, axis=1)

58. Determine whether the element in numpy.ndarray appears for the first time

np.random.seed(100)
a = np.random.randint(0, 5, 10)

# There is no direct function to do this as of 1.13.3

# Create an all True array
out = np.full(a.shape[0], True)

# Find the index positions of unique elements
unique_positions = np.unique(a, return_index=True)[1]

# Mark those positions as False
out[unique_positions] = False

print(out)

59, find the mean of each group of elements in numpy.ndarray

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')


# No direct way to implement this. Just a version of a workaround.
numeric_column = iris[:, 1].astype('float')  # sepalwidth
grouping_column = iris[:, 4]  # species

# List comprehension version
[[group_val, numeric_column[grouping_column==group_val].mean()] for group_val in np.unique(grouping_column)]

# For Loop version
output = []
for group_val in np.unique(grouping_column):
    output.append([group_val, numeric_column[grouping_column==group_val].mean()])

output

60. Convert PIL image to numpy.ndarray

from io import BytesIO
from PIL import Image
import PIL, requests

# Import image from URL
URL = 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Denali_Mt_McKinley.jpg'
response = requests.get(URL)

# Read it as Image
I = Image.open(BytesIO(response.content))

# Optionally resize
I = I.resize([150,150])

# Convert to numpy array
arr = np.asarray(I)

# Optionaly Convert it back to an image and show
im = PIL.Image.fromarray(np.uint8(arr))
Image.Image.show(im)

61. Discard all default values in numpy.ndarray

a = np.array([1,2,3,np.nan,5,6,7,np.nan])
print(a)
a[~np.isnan(a)]

62. Calculate the Euclidean distance of two numpy.ndarrays

a = np.array([1,2,3,4,5])
b = np.array([4,5,6,7,8])

# Solution
dist = np.linalg.norm(a-b)
dist

63, find the local maximum position of numpy.ndarray

a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
doublediff = np.diff(np.sign(np.diff(a)))
peak_locations = np.where(doublediff == -2)[0] + 1
peak_locations

64, numpy.ndarray subtraction operation

#需求：Subtract the 1d array b_1d from the 2d array a_2d, such that each item of b_1d subtracts from respective row of a_2d.
a_2d = np.array([[3,3,3],[4,4,4],[5,5,5]])
b_1d = np.array([1,2,3])

print(a_2d - b_1d[:,None])

65. Output the nth repeated position of the element in numpy.ndarray

x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])
print(x)
n = 5

#方法1：列表推导式
[i for i, v in enumerate(x) if v == 1][n-1]#输出元素1第5次重复的位置

#方法2
np.where(x == 1)[0][n-1]

66. Convert numpy.ndarray data format from datetime64 to datetime

dt64 = np.datetime64('2018-02-25 22:10:10')

#方法1
from datetime import datetime
dt64.tolist()


#方法2
dt64.astype(datetime)

67. Calculate the size of the numpy.ndarray data window

def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

np.random.seed(100)
Z = np.random.randint(10, size=10)
print('array: ', Z)

#方法1
moving_average(Z, n=3).round(2)

#方法2
np.convolve(Z, np.ones(3)/3, mode='valid')

68. Specify the start, end, and step length to build a numpy.ndarray

length = 10
start = 5
step = 3

def seq(start, length, step):
    end = start + (step*length)
    return np.arange(start, end, step)

seq(start, length, step)

69, complete non-continuous time series numpy.ndarray

dates = np.arange(np.datetime64('2018-02-01'), np.datetime64('2018-02-25'), 2)
print(dates)

#方法1
filled_in = np.array([
    np.arange(date, (date + d)) for date, d in zip(dates, np.diff(dates))
]).reshape(-1)

output = np.hstack([filled_in, dates[-1]])
output

#方法2
out = []
for date, d in zip(dates, np.diff(dates)):
    out.append(np.arange(date, (date + d)))

filled_in = np.array(out).reshape(-1)
output = np.hstack([filled_in, dates[-1]])
output

70. Construct a numpy.ndarray with a sliding window according to the specified step length

import numpy as np


def gen_strides(a, stride_len=5, window_len=5):
    n_strides = ((a.size - window_len) // stride_len) + 1
    # return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])
    return np.array([
        a[s:(s + window_len)]
        for s in np.arange(0, n_strides * stride_len, stride_len)
    ])


print(gen_strides(np.arange(15), stride_len=2, window_len=4))

70 NumPy high-frequency operations commonly used by data analysis workers (part 2)

table of Contents

36, find the correlation coefficient of the two columns of numpy.ndarray

37. Determine whether there is a null value in numpy.ndarray

38. Use the specified value to replace the default value in numpy.ndarray

39. Calculate the frequency of numpy.ndarray elements

40. Convert numpy.ndarray elements from numerical type to sub-type

41. Get a new column from the known column of numpy.ndarray

42. numpy.ndarray probability sampling

43. Find the second largest element after numpy.ndarray is classified according to a certain index

44. Sort by a column of numpy.ndarray

45. Pick the element with the highest frequency in numpy.ndarray

46. Output the position of the numpy.ndarray that is greater than the given element for the first time

47. Replace the elements that meet the conditions in numpy.ndarray with the given value

48.Get the element position and element of the top n in numpy.ndarray

49、求numpy.ndarray的row wise counts

50, multiple numpy.ndarray into one

51、计算numpy.ndarray的one-hot encodings numpy.ndarray

52、create row numbers grouped by a categorical variable

53、create groud ids based on a given categorical variable

54. numpy.ndarray (one-dimensional) element rank

55, numpy.ndarray (multi-dimensional) element rank

56. Output the largest element of each row of numpy.ndarray

57. Output the ratio of the minimum value to the maximum value of each row of numpy.ndarray

58. Determine whether the element in numpy.ndarray appears for the first time

59, find the mean of each group of elements in numpy.ndarray

60. Convert PIL image to numpy.ndarray

61. Discard all default values in numpy.ndarray

62. Calculate the Euclidean distance of two numpy.ndarrays

63, find the local maximum position of numpy.ndarray

64, numpy.ndarray subtraction operation

65. Output the nth repeated position of the element in numpy.ndarray

66. Convert numpy.ndarray data format from datetime64 to datetime

67. Calculate the size of the numpy.ndarray data window

68. Specify the start, end, and step length to build a numpy.ndarray

69, complete non-continuous time series numpy.ndarray

70. Construct a numpy.ndarray with a sliding window according to the specified step length

Guess you like

70 NumPy high-frequency operations commonly used by data analysis workers (part 2)

table of Contents

36, find the correlation coefficient of the two columns of numpy.ndarray

37. Determine whether there is a null value in numpy.ndarray

38. Use the specified value to replace the default value in numpy.ndarray

39. Calculate the frequency of numpy.ndarray elements

40. Convert numpy.ndarray elements from numerical type to sub-type

41. Get a new column from the known column of numpy.ndarray

42. numpy.ndarray probability sampling

43. Find the second largest element after numpy.ndarray is classified according to a certain index

44. Sort by a column of numpy.ndarray

45. Pick the element with the highest frequency in numpy.ndarray

46. ​​Output the position of the numpy.ndarray that is greater than the given element for the first time

47. Replace the elements that meet the conditions in numpy.ndarray with the given value

48.Get the element position and element of the top n in numpy.ndarray

49、求numpy.ndarray的row wise counts

50, multiple numpy.ndarray into one

51、计算numpy.ndarray的one-hot encodings numpy.ndarray

52、create row numbers grouped by a categorical variable

53、create groud ids based on a given categorical variable

54. numpy.ndarray (one-dimensional) element rank

55, numpy.ndarray (multi-dimensional) element rank

56. Output the largest element of each row of numpy.ndarray

57. Output the ratio of the minimum value to the maximum value of each row of numpy.ndarray

58. Determine whether the element in numpy.ndarray appears for the first time

59, find the mean of each group of elements in numpy.ndarray

60. Convert PIL image to numpy.ndarray

61. Discard all default values ​​in numpy.ndarray

62. Calculate the Euclidean distance of two numpy.ndarrays

63, find the local maximum position of numpy.ndarray

64, numpy.ndarray subtraction operation

65. Output the nth repeated position of the element in numpy.ndarray

66. Convert numpy.ndarray data format from datetime64 to datetime

67. Calculate the size of the numpy.ndarray data window

68. Specify the start, end, and step length to build a numpy.ndarray

69, complete non-continuous time series numpy.ndarray

70. Construct a numpy.ndarray with a sliding window according to the specified step length

Guess you like

46. Output the position of the numpy.ndarray that is greater than the given element for the first time

61. Discard all default values in numpy.ndarray