Detailed explanation of visual data analysis workflow in Python

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

This article is shared from the Huawei Cloud Community " A Comprehensive Guide to Python Visual Data Analysis from Data Acquisition to Insight Discovery " by Lemony Hug.

In the world of data science and analytics, visualization is a powerful tool that helps us understand data, discover patterns, and derive insights. Python provides a wealth of libraries and tools to make the visual data analysis workflow efficient and flexible. This article will introduce the workflow of visual data analysis in Python, from data acquisition to final visual display of insights.

1. Data acquisition

Before starting any data analysis work, you first need to obtain the data. Python provides various libraries to process data from different sources, such as pandas for processing structured data, requests for obtaining network data, or using specialized libraries to connect to databases. Let's start with a simple example, loading data from a CSV file:

import pandas as pd

# Load data from CSV file
data = pd.read_csv('data.csv')

# View the first few rows of data
print(data.head())

2. Data cleaning and preprocessing

Once the data is loaded, the next step is data cleaning and preprocessing. This includes handling missing values, outliers, data transformations, etc. Visualization also often plays an important role at this stage, helping us identify problems in the data. For example, we can use matplotlib or seaborn to draw various charts to examine the distribution and relationships of the data:

import matplotlib.pyplot as plt
import seaborn as sns

# Draw histogram
plt.hist(data['column_name'], bins=20)
plt.title('Distribution of column_name')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Draw a scatter plot
sns.scatterplot(x='column1', y='column2', data=data)
plt.title('Scatter plot of column1 vs column2')
plt.show()

3. Data analysis and modeling

After data cleaning and preprocessing, we usually perform data analysis and modeling. This may involve techniques such as statistical analysis and machine learning. At this stage, visualization can help us better understand the relationships between data and evaluate the performance of the model. For example, using seaborn to draw a correlation matrix can help us understand the correlation between features:

# Draw correlation matrix
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

4. Results presentation and insight discovery

Finally, by visually displaying the results of data analysis, we can communicate insights and conclusions more clearly. This can be a simple statistical summary or a complex interactive visualization. For example, use Plotly to create interactive charts:

import plotly.express as px

# Create an interactive scatter plot
fig = px.scatter(data, x='column1', y='column2', color='category', hover_data=['additional_info'])
fig.show()

5. Advanced techniques and optimization

In addition to basic visualization techniques, there are many advanced techniques and optimization methods in Python that can make the data analysis workflow more powerful and efficient.

5.1 Customize charts using Plotly Express

Plotly Express provides many easy-to-use functions to create various types of charts, but sometimes we need more customization options. By combining Plotly Express with Plotly's basic chart objects, we can achieve more advanced customization. For example, add comments, adjust chart style, etc.:

import plotly.express as px
import plotly.graph_objects as go

#Create a scatter plot
fig = px.scatter(data, x='column1', y='column2', color='category', hover_data=['additional_info'])

# add notes
fig.add_annotation(x=5, y=5, text="Important Point", showarrow=True, arrowhead=1)

#Adjust chart style
fig.update_traces(marker=dict(size=10, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))

fig.show()

5.2 Visual interaction using Interact

In environments such as Jupyter Notebook, using Interact visual interaction can make data analysis more dynamic and intuitive. For example, create an interactive control to control the chart's parameters:

from ipywidgets import interact

@interact(column='column1', bins=(5, 20, 1))
def plot_histogram(column, bins):
    plt.hist(data[column], bins=bins)
    plt.title(f'Distribution of {column}')
    plt.xlabel('Value')
    plt.ylabel('Frequency')
    plt.show()

5.3 Using visualization library extensions

In addition to common visualization libraries such as matplotlib, seaborn, and Plotly, there are many other visualization libraries that can extend our toolbox. For example, libraries such as Altair and Bokeh provide charts with different styles and functions, and you can choose the appropriate tool according to your needs.

import altair as alt

alt.Chart(data).mark_bar().encode(
    x='category',
    y='count()'
).interactive()

6. Automation and batch processing

Automation and batch processing are crucial when dealing with large amounts of data or when repetitive analysis is required. Python provides a wealth of libraries and tools to achieve this, for example using loops, functions, or more advanced tools like Dask or Apache Spark.

6.1 Batch processing of data using loops

Suppose we have multiple data files that require the same analysis operation, we can use a loop to batch process these files and combine the results together:

import us

data_files = os.listdir('data_folder')

results = []

for file in data_files:
    data = pd.read_csv(os.path.join('data_folder', file))
    # Perform data analysis operations
    # ...
    results.append(result)

6.2 Use functions to encapsulate repeatability analysis steps

If we have a series of data analysis steps that need to be performed repeatedly, we can encapsulate them as functions so that they can be reused on different data:

def analyze_data(data):
    # Data cleaning and preprocessing
    # ...
    #Data analysis and modeling
    # ...
    #Result display and insight discovery
    # ...
    return insights

#Apply function on each dataset
results = [analyze_data(data) for data in data_sets]

6.3 Use Dask or Apache Spark to implement distributed computing

For large-scale data sets, single-machine computing may not be able to meet the needs. In this case, you can use distributed computing frameworks such as Dask or Apache Spark to process data in parallel and improve processing efficiency:

import dask.dataframe as dd

#Create Dask DataFrame from multiple files
ddf = dd.read_csv('data*.csv')

# Execute data analysis operations in parallel
result = ddf.groupby('column').mean().compute()

7. Best practices and optimization suggestions

When performing visual data analysis, there are also some best practices and optimization suggestions that can help us make better use of Python tools:

Choose the appropriate chart type: Choose the appropriate chart type according to the data type and analysis purpose, such as bar chart, line chart, box plot, etc.
Keep charts simple and clear: Avoid excessive decoration and complex graphics, keep charts simple and easy to read, and highlight key points.
Comments and documentation: Add comments and documentation to your code to make it easier to understand and maintain, as well as to share and collaborate with others.
Performance optimization: For large-scale data sets, consider using methods such as parallel computing and memory optimization to improve code performance.
Interactive visualization: Use interactive visualization tools to make data exploration more flexible and intuitive, and improve analysis efficiency.

8. Deploy and share results

Once you have completed your data analysis and gained insights, the next step is to deploy and share the results with relevant stakeholders. Python offers a variety of ways to achieve this, including generating static reports, creating interactive applications, and even integrating the results into automated workflows.

8.1 Generate static reports

Use Jupyter Notebook or Jupyter Lab to easily create interactive data analysis reports that combine code, visualizations, and explanatory text. These notebooks can be exported to HTML, PDF, or Markdown format to share with others.

jupyter nbconvert --to html notebook.ipynb

8.2 Creating interactive applications

Data analysis results can be deployed as interactive web applications using frameworks such as Dash, Streamlit, or Flask, allowing users to interact with data and explore insights through a web interface.

import dash
import dash_core_components as dcc
import dash_html_components as html

app = dash.Dash(__name__)

# Define layout
app.layout = html.Div(children=[
    html.H1(children='Data Analysis Dashboard'),
    dcc.Graph(
        id='example-graph',
        figure={
            'data': [
                {'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'Category 1'},
                {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': 'Category 2'},
            ],
            'layout': {
                'title': 'Bar Chart'
            }
        }
    )
])

if __name__ == '__main__':
    app.run_server(debug=True)

8.3 Integration into automated workflows

Use a task scheduler such as Airflow or Celery to automate the data analysis process and regularly generate reports or update the application. This ensures that data analysis results are always up to date and can be automatically adjusted and updated as needed.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

# Define tasks
def data_analysis():
    #Data analysis code
    pass

#Define DAG
day = DAY(
    'data_analysis_workflow',
    default_args={
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2024, 1, 1),
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
    },
    schedule_interval=timedelta(days=1),
)

# Define tasks
task = PythonOperator(
    task_id='data_analysis_task',
    python_callable=data_analysis,
    day=day,
)

9. Data security and privacy protection

Data security and privacy protection are crucial during data analysis and visualization. Python provides technologies and best practices that can help us ensure that data is fully protected and secure during processing.

9.1 Data encryption and secure transmission

Ensure that secure encryption algorithms are used during data transmission and storage, such as using HTTPS for data transmission and encryption for data storage. Python's encryption libraries such as cryptography can help us implement data encryption and decryption.

from cryptography.fernet import Fernet

# Generate key
key = Fernet.generate_key()
cipher_suite = Fernet(key)

# Encrypt data
cipher_text = cipher_suite.encrypt(b"Hello, world!")

# Decrypt data
plain_text = cipher_suite.decrypt(cipher_text)

9.2 Data access control and authentication

Ensure that only authorized users can access sensitive data by implementing data access control and authentication mechanisms. You can use Python's authentication libraries such as Flask-Login, Django-Auth, etc. to implement user authentication and permission management.

from flask import Flask, request, redirect, url_for
from flask_login import LoginManager, login_user, current_user, login_required, UserMixin

app = Flask(__name__)
login_manager = LoginManager()
login_manager.init_app(app)

# User model
class User(UserMixin):
    def __init__(self, id):
        self.id = id

#User authentication callback function
@login_manager.user_loader
def load_user(user_id):
    return User(user_id)

# Login route
@app.route('/login', methods=['POST'])
def login():
    user_id = request.form['user_id']
    user = User(user_id)
    login_user(user)
    return redirect(url_for('index'))

#Routes that require login to access
@app.route('/secure')
@login_required
def secure_page():
    return 'This is a secure page'

if __name__ == '__main__':
    app.run(debug=True)

9.3 Anonymization and desensitization

During the analysis process, anonymization and desensitization can be used to protect user privacy for sensitive data. Python provides some libraries such as Faker that can generate virtual data to replace real data for analysis.

from faker import Faker

faker = Faker()

# Generate virtual name
name = faker.name()

# Generate dummy email
email = faker.email()

# Generate virtual address
address = faker.address()

Summarize

This article takes a deep dive into a comprehensive workflow for visual data analysis in a Python environment and introduces a series of key steps, technical tools, and best practices. First, we start with data acquisition, using libraries such as pandas to load and process data; then, we perform data cleaning and preprocessing, and use matplotlib, seaborn and other libraries to conduct visual exploration to identify problems and patterns in the data; then, we delve into the data In the analysis and modeling stage, statistical analysis and machine learning techniques are used to mine the inherent patterns of the data; finally, the analysis results are displayed through various methods to discover insights and provide support for business decisions.

We then further explored advanced techniques and optimizations, including using Plotly Express to customize charts, utilizing interactive visualizations, and selecting appropriate visualization libraries. Additionally, we cover the importance of automation and batch processing, and how to leverage loops, functions, and distributed computing frameworks to improve efficiency. In terms of best practices and optimization recommendations, we emphasize the importance of choosing the right chart type, keeping charts simple and clear, annotations and documentation, performance optimization, and interactive visualizations.

Finally, we paid attention to data security and privacy protection, emphasizing key measures such as data encryption and secure transmission, data access control and authentication, anonymization and desensitization. By properly applying these technologies and best practices, we can ensure the safety and reliability of the data analysis process and provide credible data support for business decisions.

To sum up, this article comprehensively explains the workflow and key technologies of visual data analysis in Python, aiming to help readers deeply understand the entire process of data analysis and master effective tools and methods to deal with complex data challenges in the real world. , thereby obtaining better analysis results and insights.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~