Python3 data science package series (3): Practical data analysis



Advanced syntax and practical combat of classes in Python3

Python3 (basic | advanced) syntax practice (|multi-threading|multi-process|thread pool|process pool technology)|solutions to multi-thread security issues

Python3 data science package series (1): Practical data analysis

Python3 data science package series (2): Practical data analysis

Python3 data science package series (3): Practical data analysis 




1: Data analysis and mining to increase cognitive dimension


We know that in data analysis and data mining, data processing is a complex and tedious task, and it is also the most important link in the entire data analysis process. On the one hand, data processing can provide the quality of the data; on the other hand, it can improve the data quality. Better use of data analysis tools;

The main contents of data processing include:

(1) Data cleaning

      1.1 Duplicate value processing

       1.2 Missing value processing

(2) Data extraction

       2.1 Field extraction

       2.2 Field splitting

       2.3 Reset index

       2.4 Record extraction

       2.5 Random sampling

       2.6 Extract data through index

       2.7 Dictionary data extraction

       2.8 Insert data

       2.9 Modify data records

(3) Data exchange

       3.1 Swap rows and columns

       3.2 Ranking index

       3.3 Data merging

(4) Data calculation

        4.1 Simple calculations (calculations of addition, subtraction, multiplication and division)

        4.2 Data standardization

         4.3 Data grouping

         4.4 Date processing

....................

(5) Data visualization

            5.1 Graphicalization

             5.2 Excel|Word|PPT化


 2: Data processing

Data cleaning cognitive upgrade:
   During data analysis, there are a large number of incomplete, inconsistent, and abnormal data in the massive raw data, which seriously affects the results of data analysis;
   Indexing is very important for data cleaning. Data cleaning is the most critical step in the data value chain. Junk data, even with the best analysis, will
   Produces erroneous results and misleads the business itself. Therefore, during the data analysis process, data cleaning takes up a large workload
   Data cleaning is to process missing data and remove meaningless information, such as deleting irrelevant data, duplicate data, and smoothing noise data in the original data set.
   Filter out data irrelevant to the analysis topic and handle missing values, outliers, etc.
Data cleaning:
   1: Handling of duplicate values
   2: Missing processing
Example 1: Duplicate data processing


# -*- coding:utf-8 -*-

import pandas as pd
from pandas import Series

"""
    Data cleaning cognitive upgrade:
       During data analysis, there are a large number of incomplete, inconsistent, and abnormal data in the massive raw data, which seriously affects the results of data analysis;
       Indexing is very important for data cleaning. Data cleaning is the most critical step in the data value chain. Junk data, even with the best analysis, will
       Produces erroneous results and misleads the business itself. Therefore, during the data analysis process, data cleaning takes up a large workload
       Data cleaning is to process missing data and remove meaningless information, such as deleting irrelevant data, duplicate data, and smoothing noise data in the original data set.
       Filter out data irrelevant to the analysis topic and handle missing values, outliers, etc.
    Data cleaning:
       1: Handling of duplicate values
       2: Missing processing
"""

print("""
    (1)Handling of duplicate values
          Use the duplicated method in DataFrame to return a Boolean Series to show whether there are duplicate rows. If there are no duplicate rows, FALSE will be displayed;
          If there are duplicates, they will be displayed as TRUE from the second line onwards.
    (2) Use the drop_duplicates method to remove data with the same rows in the data structure (retaining only one row). This method returns a DataFrame data frame.
""")
dataFrame = pd.DataFrame({
    'age': Series([26, 85, 64, 85, 85]),
    'name': Series(['Ben', 'John', 'Jerry', 'John', 'John'])
})
print(dataFrame)
# Show those lines with duplicates
repeatableDataFrame = dataFrame.duplicated()
print()
print(repeatableDataFrame)
print("""
    Remove duplicate rows
""")
print("View duplicate rows in name column")
print(dataFrame.duplicated('name'))
print()
print("View duplicate rows in age column")
print(dataFrame.duplicated('age'))
print()
print("Remove duplicate rows based on age column")
print(dataFrame.drop_duplicates('age'))

print()
print("Remove duplicate rows based on name column")
print(dataFrame.drop_duplicates('name'))

running result:


D:\program_file_worker\anaconda\python.exe D:\program_file_worker\python_source_work\SSO\grammar\dataanalysis\DataAnalysisDataCleaning.py 

    (1) The processing of duplicate values
          ​​uses the duplicated method in DataFrame to return a Boolean Series to show whether there are duplicate rows. If there are no duplicate rows, FALSE will be displayed; if
          there are duplicates, they will be displayed as TRUE from the second row onwards.
    (2) Use The drop_duplicates method is used to remove data with the same rows in the data structure (retaining only one row). This method returns a DataFrame data frame.

   age   name
0   26    Ben
1   85   John
2   64  Jerry
3   85   John
4   85   John

0    False
1    False
2    False
3     True
4     True
dtype: bool

    Remove duplicate rows

View duplicate rows in name column
0 False
1 False
2 False
3 True
4 True
dtype: bool

View duplicate rows in age column
0 False
1 False
2 False
3 True
4 True
dtype: bool

Remove duplicate rows based on age column
   age name
0 26 Ben
1 85 John
2 64 Jerry

Remove duplicate rows based on name column
   age name
0 26 Ben
1 85 John
2 64 Jerry

Process finished with exit code 0
 

Three: Missing value processing

Cognitive dimensionality enhancement
   Statistically speaking, missing data may produce biased estimates, resulting in the sample data not being a good representation of the population.
   In reality, most data contain missing values, so how to deal with missing values ​​is very important.
   Generally speaking, the processing of missing values ​​includes two steps:
   (1) Identification of missing data
    (2) Handling of missing data

# -*- coding:utf-8 -*-

import pandas as pd
from pandas import Series

"""
   Cognitive dimensionality enhancement
      Statistically speaking, missing data may produce biased estimates, resulting in the sample data not being a good representation of the population.
      In reality, most data contain missing values, so how to deal with missing values ​​is very important.
      Generally speaking, the processing of missing values ​​includes two steps:
      (1) Identification of missing data
       (2) Handling of missing data
"""

print("Read data source: ")
dataFrame = pd.read_excel(r'./file/rz.xlsx', sheet_name='Sheet2')
print(dataFrame)

print("""
    1) Identification of missing values
    Pandas uses the floating point value NaN to represent missing data in floating point numbers and non-floating point arrays, and uses the .isnull and .notnull functions to determine missing data.
""")
print()
print("Missing value judgment; True means missing, False means not missing")
print(dataFrame.isnull())

print()
print("Missing value judgment; True means not missing, False means missing")
print(dataFrame.notnull())

print("""
    2) Missing value processing
    Methods for handling missing data include data completion, deletion of corresponding rows, and no processing.
      2.1 dropna() removes data rows with empty values ​​in the data structure
      2.2 fillna() replaces NaN with other numbers; sometimes directly deleting empty data will affect the analysis results, and the data can be filled.
      2.3 fillna(method = 'pad') replaces NaN with the previous data value
""")

print("Delete rows corresponding to empty data: ")
print(dataFrame.dropna())
print()
print("Replace missing values ​​with numeric values ​​or any characters:")
print(dataFrame.fillna("$"))
print()
print("Replace missing value with previous value: ")
print(dataFrame.fillna(method='pad'))

print()
print("Replace missing value with next value:")
print(dataFrame.fillna(method='bfill'))

print()
print("Replace NaN with average or other descriptive statistics")
print(dataFrame.fillna(dataFrame.mean(numeric_only=True)))

print()
print("""
   dataFrame.mean()['Fill column name':'Column name for calculating mean']: You can use the mean value of the selected column to process missing values.
""")

print(dataFrame.fillna(dataFrame.mean(numeric_only=True)['高代':'解几']))
print()
print("dataFrame.fillna({'Column name 1': value 1, 'Column name 2': value 2}): You can pass in a dictionary to fill different columns with different values")
print(dataFrame.fillna({'minutes': 100, 'expensive': 0}))

print("Use strip() to clear the string specified on the left, right or beginning and end of the string. The default is spaces and the middle is not cleared")
dataFrameStrip = pd.DataFrame({
    'age': Series([26, 85, 64, 85, 85]),
    'name': Series(['Ben', 'John   ', 'Jerry', 'John     ', '  John'])
})
print(dataFrameStrip)
print()
print(dataFrameStrip['name'].str.strip())

print()
print("Only delete the character n on the right, if you do not specify the string to be deleted, spaces will be deleted by default")
print(dataFrameStrip['name'].str.rstrip())
print()
print(dataFrameStrip['name'].str.rstrip('n'))

print()
print("Only delete the character n on the left. If you do not specify the string to be deleted, spaces will be deleted by default")
print(dataFrameStrip['name'].str.lstrip())
print()
print(dataFrameStrip['name'].str.lstrip('J'))

running result:

D:\program_file_worker\anaconda\python.exe D:\program_file_worker\python_source_work\SSO\grammar\dataanalysis\DataAnalysisDataMissing.py 
Read data source: 
           Student ID Name English Fraction Advanced Algebra Solution
0 2308024241 Jackie Chan 76 40.0 23.0 60
1 2308024244 Zhou Yi66 47.0 47.0 44
2 2308024251 Zhang Bo 85 NaN 45.0 60
3 2308024249 Zhu Hao 65 72.0 62.0 71
4 2308024219 Seal 73 61.0 47.0 46
5 2308024201 Chi Pei 60 71.0 76.0 71
6 2308024347 Li Hua 67 61.0 65.0 78 7
2308024307 Chen Tian 76 69.0 NaN 69
8 2308024326 Yu Hao 66 65.0 61.0 71
9 2308024219 Seal 73 61.0 47.0 46

    1) Identification of missing values
    ​​Pandas uses the floating point value NaN to represent missing data in floating point numbers and non-floating point arrays, and uses the .isnull and .notnull functions to determine the missing situation.


笨失倠研究;True 太失火失,False 太天非笤失学
      号 姐名 英語 数分 高代 解几
0 False False False False False False False
1 False False False False False False
2 False False False True False False
3 False False False False False False
4 False False False False False False
5 False False False False False False
6 False False False False False False
7 False False False False True False
8 False False False False False False
9 False False False False False False

Missing value judgment; True means non-missing, False means missing
     student number Name English Score Advanced Algebra Solution
0 True True True True True True
1 True True True True True True
2 True True True False True True
3 True True True True True True
4 True True True True True True
5 True True True True True True
6 True True True True True True
7 True True True True False True
8 True True True True True True
9 True True True True True True

    2) Missing value processing:
    The methods for processing missing data include data completion, deletion of corresponding rows, and no processing.
      2.1 dropna() removes data rows with empty values ​​in the data structure.
      2.2 fillna() replaces NaN with other numbers; some Directly deleting empty data will affect the analysis results. You can fill in the data.
      2.3 fillna(method = 'pad') replaces NaN with the previous data value.

Delete the rows corresponding to empty data: 
           Student ID Name English Fraction Advanced Solution
0 2308024241 Jackie Chan 76 40.0 23.0 60
1 2308024244 Zhou Yi 66 47.0 47.0 44
3 2308024249 Zhu Hao 65 72.0 62.0 71
4 2308024219 Seal 73 61.0 47.0 46
5 2308024201 Chi Pei60 71.0 76.0 71
6 2308024347 Li Hua 67 61.0 65.0 78
8 2308024326 Yu Hao 66 65.0 61.0 71
9 2308024219 Seal 73 61.0 47 .0 46

Use numerical values ​​or any characters to replace missing values:
           Student ID Name English Fraction Advanced Solution
0 2308024241 Jackie Chan 76 40.0 23.0 60
1 2308024244 Zhou Yi 66 47.0 47.0 44
2 2308024251 Zhang Bo 85 $ 45.0 60 3
230802 4249 Zhu Hao 65 72.0 62.0 71
4 2308024219 Seal 73 61.0 47.0 46
5 2308024201 Chi Pei 60 71.0 76.0 71 6 2308024347
Li Hua 67 61.0 65.0 78 7 2308024307
Chen Tian 76 69.0 $ 69
8 2308024326 Yu Hao 66 65.0 61.0 71
9 2308024219 Seal 73 61.0 47.0 46

Replace the missing value with the previous value: 
           Student ID Name English Fraction High Generation Solution
0 2308024241 Jackie Chan 76 40.0 23.0 60
1 2308024244 Zhou Yi 66 47.0 47.0 44
2 2308024251 Zhang Bo 85 47.0 45.0 60
3 23 08024249 Zhu Hao 65 72.0 62.0 71
4 2308024219 Seal 73 61.0 47.0 46
5 2308024201 Chi Pei 60 71.0 76.0 71
6 2308024347 Li Hua 67 61.0 65.0 78
7 2308024307 Chen Tian 76 69.0 65.0 69
8 2308024326 Yu Hao 66 65.0 61.0 71
9 2308024219 Seal 73 61.0 47.0 46

Replace the missing value with the latter value:
           Student ID Name English Fraction High Generation Solution
0 2308024241 Jackie Chan 76 40.0 23.0 60
1 2308024244 Zhou Yi 66 47.0 47.0 44
2 2308024251 Zhang Bo 85 72.0 45.0 60
3 23 08024249 Zhu Hao 65 72.0 62.0 71
4 2308024219 Seal 73 61.0 47.0 46
5 2308024201 Chi Pei 60 71.0 76.0 71
6 2308024347 Li Hua 67 61.0 65.0 78
7 2308024307 Chen Tian 76 69.0 61.0 69
8 2308024326 Yu Hao 66 65.0 61.0 71
9 2308024219 Seal 73 61.0 47.0 46

Use average or other descriptive statistics to replace NaN
           student number Name English Fraction High Generation Solution
0 2308024241 Jackie Chan 76 40.000000 23.000000 60
1 2308024244 Zhou Yi 66 47.000000 47.000000 44
2 2308024251 Zhang Bo 85 60.777778 45.000000 60
3 2308024249 Zhu Hao 65 72.000000 62.000000 71
4 2308024219 封印 73 61.000000 47.000000 46
5 2308024201 迟培 60 71.000000 76.000000 71
6 2308024347 李华 67 61.000000 65.000000 78
7 2308024307 陈田 76 69.000000 52.555556 69
8 2308024326 余皓 66 65.000000 61.000000 71
9 2308024219 封印 73 61.000000 47.000000 46


   dataFrame.mean()['Fill column name':'Column name for calculating mean']: You can use the mean value of the selected column to process missing values.

           Student ID Name English Fractions Advanced Generation Solution
0 2308024241 Jackie Chan 76 40.0 23.000000 60
1 2308024244 Zhou Yi 66 47.0 47.000000 44
2 2308024251 Zhang Bo 85 NaN 45.000000 60
3 2308024249 Zhu Hao 65 72.0 62.000000 71
4 2308024219 Seal 73 61.0 47.000000 46
5 2308024201 Chi Pei 60 71.0 76.000000 71
6 2308024347 Li Hua 67 61.0 65.000000 78
7 2308024307 Chen Tian 76 69.0 52.555556 69
8 2308024326 Yu Hao 66 65.0 61.000000 71
9 2308024219 Seal 73 61.0 47.000000 46

dataFrame.fillna({'Column Name 1': Value 1, 'Column Name 2': Value 2}): You can pass in a dictionary to fill different columns with different values. Student ID, Name, English, Fraction, Advanced Algebra,
           Solution
0 2308024241 Jackie Chan 76 40.0 23.0 60
1 2308024244 Zhou Yi 66 47.0 47.0 44 2 2308024251 Zhang Bo
85 100.0 45.0 60 3
2308024249 Zhu Hao 65 72.0 62.0 71 4 2308024219
Seal 73 61.0
47.0 46 5 2308024201 Chi Pei 60 71.0 76.0 71 6
2308024347 Li Hua 67 61.0 65.0 78
7 2308024307 Chen Tian 76 69.0 0.0 69
8 2308024326 Yu Hao 66 65.0 61.0 71
9 2308024219 Seal 73 61.0 47.0 46
Use strip() to clear the string specified by the left, right or beginning and end of the string, The default is a space, and
   age is not cleared in the middle. name
0 26 Ben
1 85 John   
2 64 Jerry
3 85 John     
4   85       John

0      Ben
1     John
2    Jerry
3     John
4     John
Name: name, dtype: object

Only delete the character n on the right side. If the deleted string is not specified, spaces will be deleted by default
0 Ben
1 John
2 Jerry
3 John
4 John
Name: name, dtype: object

0           Be
1      John   
2        Jerry
3    John     
4          Joh
Name: name, dtype: object

Only delete the character n on the left. If the deleted string is not specified, spaces will be deleted by default
0 Ben
1 John   
2 Jerry
3 John     
4 John
Name: name, dtype: object

0         Ben
1      ohn   
2        erry
3    ohn     
4        John
Name: name, dtype: object

Process finished with exit code 0
 


Guess you like

Origin blog.csdn.net/u014635374/article/details/133489206