Data Science Essentials Data Engineers Interview QA with Python 9 Examples

Going for an interview can be a time-consuming and tiring process, while technical interviews can be even more stressful (with all kinds of unexpected written exams)! Here are some common questions you will be asked during a data engineer interview. You'll learn how to answer interview questions about databases, Python, and SQL.

insert image description here

What does a data engineer do?

A data engineering role can be a broad and varied one. Working knowledge of multiple technologies and concepts is required. Data engineers have flexible minds. So you can be proficient in multiple areas such as databases, software development, DevOps and big data.

Given their different skill sets, data engineering roles can span many different job descriptions. Data engineers can be responsible for database design, schema design, and creating multiple database solutions. The job may also involve database administrators.

As a data engineer, you can act as a bridge between the database and data science teams. In this case, data cleaning and preparation will also be taken care of. If big data is involved, then the job is to come up with an efficient solution for that data.

There may also be a need for efficient data querying for reporting and analysis, interacting with multiple databases or writing stored procedures. For many solutions, such as high-traffic websites or services, there may be multiple databases. In these cases, data engineers are responsible for setting up databases, maintaining them, and transferring data between them.

Why data engineers prefer Python

Python is especially useful in data science, backend systems, and server-side scripting. This is because Python has powerful types, simple syntax, and a large number of third-party libraries to use. Pandas, SciPy, Tensorflow, SQLAlchemy and NumPy are the most widely used libraries in production across different industries.

On top of that, Python reduces development time, which means less expenses for companies. For data engineers, most code execution is database bound, not CPU bound. Because of this, it makes sense to take advantage of Python's simplicity, even when compared to compiled languages ​​like C# and Java, whose performance is degraded.

Questions about relational databases

The database is one of the most important components in the system. Several new tools and technologies have been introduced by several large companies, including NoSQL, cached databases, graph databases, and NoSQL support in SQL databases .
insert image description here

Q1: Relational and non-relational databases

A relational database is a database that stores data in tabular form. Every table has a schema, which are the columns and types a record must have. Each schema must have at least one primary key that uniquely identifies the record. In other words, there are no duplicate rows in the database. In addition, each table can be related to other tables using foreign keys.

An important aspect of relational databases is that changes in the schema must be applied to all records. This can sometimes lead to corruption and major headaches during the migration process. Non-relational databases handle things differently. They are schemaless in nature, which means that records can be kept using different schemas and different nesting structures. Records can still have primary keys, but schema changes are done on an entry-by-entry basis.

Speed ​​comparison tests need to be performed based on the type of function being performed. INSERT, UPDATE, DELETE or other functions can be selected. Schema design, indexes, number of aggregates, and number of records also affect this analysis, so it needs to be tested thoroughly.

The scalability of the database is also different. Distribution for non-relational databases may be less of a headache. This is because a collection of related records can easily be stored on a specific node. Relational databases, on the other hand, require more thinking and often use a master-slave system.

Q2: SQL aggregate functions

Aggregate functions are functions that perform mathematical operations on result sets. Some examples include AVG, COUNT, MIN, MAX, and SUM. GROUP BY and HAVING clauses are often required to complement these aggregates. A useful aggregation function is AVG, which can be used to calculate the average for a given set of results.

Q3: Speed ​​up SQL queries

The speed depends on various factors, but is mainly affected by the number of: joins, aggregations, traversals, records.

The more connections, the higher the complexity and the greater the number of traversals in the table. Doing multiple joins on thousands of records involving multiple tables is expensive because the database also needs to cache intermediate results! At this point, you might start thinking about how to increase the memory size.

Speed ​​is also affected by the presence or absence of indexes in the database. Indexes are very important and allow to quickly search the table and find matches for certain columns specified in the query.

Indexes sort records at the cost of higher insertion time and some storage. Multiple columns can be combined to create a single index. For example there may be merges between columns because the query depends on these two conditions.

Q4: Debug SQL queries

Most databases contain an EXPLAIN QUERY PLAN that describes the steps the database takes to execute a query. For SQLite, this can be enabled by adding a SELECT before the statement EXPLAIN QUERY PLAN.

Questions about non-relational databases

Non-relational database NoSQL, the goal is to highlight the advantages and differences with relational databases.
insert image description here

Q5: Query data using MongoDB

Similar to SQL, document-based databases also allow queries and aggregations to be performed. But functions may differ both in syntax and in the underlying execution. In fact you may have noticed that MongoDB reserves the $ character to specify some commands or aggregations on records, such as $group.

Even though the syntax may be only slightly different, there is a huge difference in how the query is executed under the hood. Because query structures and use cases differ between SQL and NoSQL databases.

Q6 : NoSQL 与 SQL

If there is a changing schema, such as financial regulatory information, NoSQL can modify records and nest related information. Imagine the number of joins that would have to be performed in SQL if there were eight nested sequences, and this kind of thing happens quite often!

Now what if you want to run a report, extract information about that financial data, and infer conclusions? In this case, complex queries need to be run, and SQL tends to be faster in this regard.

Speed ​​isn't the only measure, though. Factors such as transactions, atomicity, durability, and scalability also need to be considered. Transactions are important in financial applications, and these functions have a higher priority.

Questions about caching databases

A cache database holds frequently accessed data. They coexist with major SQL and NoSQL databases. The goal is to lighten the load and process requests faster.
insert image description here

Q7: How to use the cache database

A cache database is a fast storage solution for storing short-term, structured or unstructured data. It can be partitioned and expanded as needed, but is usually much smaller than the main database. So the database can reside in memory, eliminating the need to read from disk.

When a request comes in the cache database is checked first, then the main database. This prevents any unnecessary duplicate requests from reaching the primary database's server. Also benefit from a performance boost due to the lower read times of the cached database!

Questions about Design Patterns and ETL Concepts

In large applications, more than one type of database is often used. In fact it is possible to use PostgreSQL, MongoDB and Redis in one application! A challenging problem is dealing with state changes between databases, which exposes developers to consistency issues.

If you get an inconsistent and outdated result! The results returned from the second database will not reflect the updated values ​​in the first database. This can happen with any two databases, but is especially common when the primary database is a NoSQL database and the information is translated into SQL for querying.

The database may have background workers to solve these problems. These workers pull data from a database, transform it in some way, and load it into the target database. When converting from a NoSQL database to a SQL database, the Extract, Transform, Load (ETL) process performs the following steps:

  • Fetch: Whenever a record is created, updated, etc., there is a MongoDB trigger. Call the callback function asynchronously on a separate thread.
  • Transformation: Part of the record is extracted, normalized, and put into the correct data structure (or row) for insertion into SQL.
  • Load: SQL database updates in bulk, or bulk writes as a single record.

This workflow is very common in financial, gaming, and reporting applications. In these cases, the changing schema requires a NoSQL database, but a SQL database is required for reporting, analysis, and aggregation.

Q8: Design Patterns in Big Data

Imagine that Ali needs to create a recommendation system to recommend suitable products to users. Data science teams need a lot of data! They go to you (the data engineer) and ask to create a separate temporary database warehouse. There they will clean and transform the data.

It may be shocking to receive such a request. When you have terabytes of data, you will need multiple machines to process all this information. Database aggregate functions can be a very complex operation. How to efficiently query, aggregate and utilize relatively large data?

Apache originally introduced MapReduce, which follows the map, shuffle, reduce workflow. The idea is to map different data onto different machines, also known as clusters. Work can then be performed on the data, grouped by key, and finally aggregated in the final stage. Design patterns form the basis of most big data workflows.

Q9: What ETL Processes and Big Data Workflows Have in Common

Both workflows follow the producer-consumer pattern. Workers (producers) produce some kind of data and output it to a pipeline. This pipeline can take many forms, including network messages and triggers. After the Producer outputs the data, the Consumer consumes and uses it. These workers usually work asynchronously and execute in separate processes.

A Producer can be likened to the extraction and transformation steps of an ETL process. Similarly in big data, mapper can be seen as Producer, and reducer is actually Consumer. This separation of concerns is extremely important and effective in the development and architectural design of applications.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124170837