Importance of Configuration Files in Python : A Focus on ETL Pipelines
Content
This is one of the most important concepts in python project development. I have taken few part of the page from reference.
Introduction
When building complex Python projects like ETL (Extract, Transform, Load) pipelines, managing configurations effectively becomes crucial. Configuration files are an essential tool in this process, providing a centralized way to manage settings, improve maintainability, and enhance flexibility. Let's dive into what configuration files are, how they are helpful, and how to use them in your ETL projects.What Are Configuration Files?
Configuration files store settings and parameters that a program needs to run. Instead of hardcoding values into your code, which can lead to maintenance challenges and inflexibility, configuration files allow you to separate configuration from code. These files can be in various formats, including JSON, YAML, INI, and environment variables.- JSON: Easy to read and widely used, especially in web applications.
- YAML (Yet Another Markup Language): More human-readable than JSON, often used in DevOps and configuration management.
- INI: Simple format for key-value pairs, commonly used in older applications.
- TOML: Configure from .toml. TOML files define variables via key/value pairs in a similar manner to that of ini files. Unlike ini files, however, TOML expects that the values of keys to be stored as the data type they're intended to be utilized as.
- Environment Variables: Managed by the operating system, suitable for sensitive information (for more details, check out my blog post on best habits of Python programming enviroment variable ).
Why Use Configuration Files?
- Centralized Management: Configuration files allow you to manage settings in one place, making it easier to update and maintain.
- Flexibility: You can change configurations without modifying the codebase, facilitating easier adjustments across different environments (development, testing, production).
- Security: Sensitive information, like database credentials and API keys, can be managed securely using environment variables or encrypted configuration files.
- Scalability: As projects grow, configuration files help in managing complexity by organizing settings logically.
- Portability: Configuration files make it easier to share and deploy your code across different environments and systems.
How to Use json-Configuration Files in ETL Projects?
Let's consider an example ETL pipeline where we need to extract data from a source, transform it, and load it into a destination. We'll use a JSON configuration file to manage our settings.- Step 1: Define the Configuration File: Create a
config.json
file with necessary settings, such as source and destination details, transformation rules, and logging configurations.# Json-configuration file { "source": { "type": "database", "host": "localhost", "port": 5432, "database": "source_db", "user": "source_user", "password": "source_password" }, "destination": { "type": "database", "host": "localhost", "port": 5432, "database": "destination_db", "user": "dest_user", "password": "dest_password" }, "transformations": [ {"name": "remove_nulls", "fields": ["field1", "field2"]}, {"name": "convert_types", "fields": {"field3": "int", "field4": "float"}} ], "logging": { "level": "INFO", "file": "etl.log" } }
- Step 2: Load the Configuration File in Python: In the ETL script, load the configuration file to access the settings.
# Main python script import json def load_config(config_file): with open(config_file, 'r') as file: config = json.load(file) return config config = load_config('config.json')
- Step 3: Use Configurations in ETL Pipeline: Utilize the loaded configurations in your ETL process.
# Main python script import logging import psycopg2 # Function to set up logging def setup_logging(log_config): logging.basicConfig(filename=log_config['file'], level=log_config['level']) logging.info("Logging is set up.") # Function to extract data from the source def extract_data(source_config): connection = psycopg2.connect( host=source_config['host'], port=source_config['port'], database=source_config['database'], user=source_config['user'], password=source_config['password'] ) cursor = connection.cursor() cursor.execute("SELECT * FROM source_table") data = cursor.fetchall() connection.close() return data # Function to transform the data def transform_data(data, transformations): for transformation in transformations: if transformation['name'] == 'remove_nulls': data = [row for row in data if None not in row] elif transformation['name'] == 'convert_types': for row in data: for field, dtype in transformation['fields'].items(): if dtype == 'int': row[field] = int(row[field]) elif dtype == 'float': row[field] = float(row[field]) return data # Function to load data into the destination def load_data(destination_config, data): connection = psycopg2.connect( host=destination_config['host'], port=destination_config['port'], database=destination_config['database'], user=destination_config['user'], password=destination_config['password'] ) cursor = connection.cursor() for row in data: cursor.execute("INSERT INTO destination_table VALUES (%s, %s, %s, %s, %s)", row) connection.commit() connection.close() # Main function to run the ETL pipeline def main(): config = load_config('config.json') setup_logging(config['logging']) logging.info("Starting ETL process.") data = extract_data(config['source']) transformed_data = transform_data(data, config['transformations']) load_data(config['destination'], transformed_data) logging.info("ETL process completed.") if __name__ == '__main__': main()
- Step 4: Managing Sensitive Information: For sensitive information like passwords, consider using environment variables or secret management tools.
# Example with Environment Variables: import os source_password = os.getenv('SOURCE_PASSWORD') destination_password = os.getenv('DEST_PASSWORD') # Use these variables in the connection setup source_config['password'] = source_password destination_config['password'] = destination_password
Note: If we need to update the main function, then we can do it as follows:def main(): config = load_config('config.json') # Overwrite passwords with environment variables config['source']['password'] = os.getenv('SOURCE_PASSWORD', config['source']['password']) config['destination']['password'] = os.getenv('DEST_PASSWORD', config['destination']['password']) setup_logging(config['logging']) logging.info("Starting ETL process.") data = extract_data(config['source']) transformed_data = transform_data(data, config['transformations']) load_data(config['destination'], transformed_data) logging.info("ETL process completed.")
- Conclusion: Configuration files are invaluable in managing complex Python projects, especially ETL pipelines. They provide a centralized, flexible, and secure way to handle settings, making your codebase cleaner and easier to maintain. By separating configuration from code, you enhance the scalability and portability of your projects, ensuring they can adapt to various environments and requirements with minimal effort.
Setting configuration using the enviroment variables
In the world of data engineering and machine learning, maintaining clean and secure code is paramount. One key practice that enhances both security and efficiency is the use of enviroment variables in a file.env
in the project directory. The .env
file is a simple text file used to store environment variables. These variables can include sensitive
information like API keys, database credentials, and configuration settings.
- Security: Storing sensitive data in `.env` files keeps it out of your main codebase, reducing the risk of exposing secrets when sharing or uploading your code to platforms like GitHub.
- Configuration Management: `.env` files allow for easy management of different configurations for development, testing, and production environments. This makes your code more adaptable and easier to maintain.
- Simplicity: Loading environment variables from a `.env` file is straightforward, using libraries like python-`dotenv` in Python.
Loading the .env
file
Create `.env` in the root directory of the project and save the environment variables, such as database URLs, API keys, and secret keys, which are sensitive information that you don't want to expose in your code. For example:
DATABASE_URL=your_database_url
API_KEY=your_api_key
SECRET_KEY=your_secret_key
To avoid sharing your secrets, add `.env` to the `gitignore` file. Adding .env to .gitignore protects your sensitive information from being exposed publicly or shared unintentionally. To load the enviroment variables in the python scripts, Use
from dotenv import load_dotenv
import os
load_dotenv()
database_url = os.getenv("DATABASE_URL")
api_key = os.getenv("API_KEY")
secret_key = os.getenv("SECRET_KEY")
where:
load_dotenv()
loads the environment variables from the .envos.getenv()
retrieves the value of the specified environment variable.
Configure from .ini
ini files are perhaps the most straight configuration files available to us.ini
files are highly suitable for smaller projects, mostly because these files only support hierarchies 1-level deep. ini
files are essentially flat files, with the exception that variables can belong to groups. The below example demonstrates how variables sharing a common theme may fall under a common title, such as [DATABASE]
or [LOGS]
(the config.ini is):
[APP]
ENVIRONMENT = development
DEBUG = False
[DATABASE]
USERNAME = root
PASSWORD = p@ssw0rd
HOST = 127.0.0.1
PORT = 5432
DB = my_database
[LOGS]
ERRORS = logs/errors.log
INFO = data/info.log
[FILES]
STATIC_FOLDER = static
TEMPLATES_FOLDER = templates
This structure surely makes things easier to understand by humans, but the practicality of this structure goes beyond aesthetics. Let's parse this file with Python's
configparser library to see what's really happening. We get started by saving the contents of config.ini
to a variable called config in a python script ():
"""Load configuration from .ini file."""
import configparser
# Read local file `config.ini`.
config = configparser.ConfigParser()
config.read('settings/config.ini') # suppose our config.ini is in 'setting' folder of home directory.
Calling read()
on an ini
file does much more than store plain data; our config variable is now a unique data structure, allowing us various methods for reading and writing values to our config. Try running print(config) to see for yourself: < configparser.ConfigParser object at 0x10e58c390 >
. Now using config
variable, we can get various values from the config.ini
(here we use get()
method, where it returns the value as a string) as:
# Get values from our .ini file
config.get('DATABASE', 'HOST') # retrieves the value associated with the 'HOST' key from the 'DATABASE'
config['DATABASE']['HOST'] # alternative way to access the same value.
config.getboolean('APP', 'DEBUG') # retrieves the value associated with the 'DEBUG' key from the 'APP'
# section and converts it to a boolean.
The main difference is between the first and second is that if the key doesn't exist, this method will raise a KeyError
instead of a configparser.NoOptionError
. This method retrieves the value associated with the 'DEBUG' key from the 'APP' section and converts it to a boolean.
Configuring Applications from .yaml Files
YAML is a human-readable data serialization standard that can be used in conjunction with all programming languages and is often used to write configuration files. Its syntax is minimal and straightforward, which makes it easier for humans to read and write compared to JSON or XML.- Why Use YAML for Configuration?
- Readability: YAML’s syntax is clean and easy to understand.
- Hierarchical Data: It naturally represents hierarchical data, making it suitable for complex configurations.
- Language-agnostic: YAML can be used with any programming language, making it versatile.
- Basic Structure of a .yaml File:
- Configuring Applications with YAML: Let’s walk through configuring an application using a YAML file. We’ll use a Python application as an example.
- Step-1: Create a YAML Configuration File: First, create a file named
config.yaml
:app: name: MyApp version: 1.0 database: host: localhost port: 5432 user: admin password: secret logging: level: INFO file: /var/log/myapp.log
- Step 2: Load the YAML Configuration in Your Application: To read and use this configuration in your application, you’ll need a YAML parser. In Python,
PyYAML
is a popular library for this purpose. Next, load the configuration file in your Python application: - Step 3: Using the Configuration: With the configuration loaded, you can now use it to set up your application. For example, configuring a database connection and logging:
import logging import psycopg2 # Configure logging logging.basicConfig(level=config['logging']['level'], filename=config['logging']['file'], format='%(asctime)s %(levelname)s:%(message)s') # Connect to the database conn = psycopg2.connect( host=config['database']['host'], port=config['database']['port'], user=config['database']['user'], password=config['database']['password'] ) logging.info('Application started') logging.info('Connected to the database')
- Advanced Configurations: YAML supports more complex data structures, which can be very useful for advanced configurations.
- Nested Configurations: You can nest configurations to represent complex settings:
server: host: localhost ports: http: 80 https: 443
- Lists: YAML also supports lists, which can be useful for configurations like allowed hosts or user roles:
- Anchors and Aliases: YAML supports anchors and aliases to reuse parts of the configuration. This can help avoid redundancy:
- Error Handling: When working with configurations, it’s essential to handle errors gracefully. For example, checking if required fields are present and logging errors:
database:
host: localhost
port: 5432
user: admin
password: secret
features:
enable_feature_x: true
max_connections: 100
import yaml
def load_config(file_path):
with open(file_path, 'r') as file:
config = yaml.safe_load(file)
return config
config = load_config('config.yaml')
# Accessing configuration values
app_name = config['app']['name']
db_host = config['database']['host']
log_level = config['logging']['level']
print(f"App Name: {app_name}")
print(f"Database Host: {db_host}")
print(f"Log Level: {log_level}")
allowed_hosts:
- localhost
- 127.0.0.1
- example.com
user_roles:
- admin
- user
- guest
defaults: &defaults
adapter: postgres
host: localhost
development:
<<: *defaults
database: dev_db
production:
<<: *defaults
database: prod_db
host: prod_host
try:
db_host = config['database']['host']
except KeyError as e:
logging.error(f'Missing configuration for {e}')
raise
appName: appName
logLevel: WARN
AWS:
Region: us-east-1
Resources:
EC2:
Type: "AWS::EC2::Instance"
Properties:
ImageId: "ami-0ff8a91507f77f867"
InstanceType: t2.micro
KeyName: testkey
BlockDeviceMappings:
-
DeviceName: /dev/sdm
Ebs:
VolumeType: io1
Iops: 200
DeleteOnTermination: false
VolumeSize: 20
Lambda:
Type: "AWS::Lambda::Function"
Properties:
Handler: "index.handler"
Role:
Fn::GetAtt:
- "LambdaExecutionRole"
- "Arn"
Runtime: "python3.7"
Timeout: 25
TracingConfig:
Mode: "Active"
routes:
admin:
url: /admin
template: admin.html
assets:
templates: /templates
static: /static
dashboard:
url: /dashboard
template: dashboard.html
assets:
templates: /templates
static: /static
account:
url: /account
template: account.html
assets:
templates: /templates
static: /static
databases:
cassandra:
host: example.cassandra.db
username: user
password: password
redshift:
jdbcURL: jdbc:redshift://<IP>:<PORT>/file?user=username&password=pass
tempS3Dir: s3://path/to/redshift/temp/dir/
redis:
host: hostname
port: port-number
auth: authentication
db: databaseconfig.yaml
In this case, we can read the yaml file as:
# yaml_config.py
"""Load configuration from .yaml file."""
import confuse
config = confuse.Configuration('MyApp', __name__)
runtime = config['AWS']['Lambda']['Runtime'].get()
print(runtime)
Confuse also gets into the realm of building CLIs, allowing us to use our YAML file to inform arguments which can be passed to a CLI and their potential values:
# cli_config.py
config = confuse.Configuration('myapp')
parser = argparse.ArgumentParser()
parser.add_argument('--foo', help='a parameter')
args = parser.parse_args()
config.set_args(args)
print(config['foo'].get())
References
Some other interesting things to know:
- Visit my website on For Data, Big Data, Data-modeling, Datawarehouse, SQL, cloud-compute.
- Visit my website on Data engineering