Announcement

👇Official Account👇

Welcome to join the group & private message

Article first/tail QR code

Skip to content

Common Encryption and Decryption Algorithm Feature Collection for Crawlers

Introduction

When developing web crawlers, encountering encrypted data is inevitable. Modern websites use various encryption algorithms to protect their data, making reverse engineering an essential skill for crawler developers. This article provides a comprehensive collection of common encryption algorithm features and their identification methods.

Why Understanding Encryption Matters

Encryption algorithms are widely used in web applications for:

  • Data Protection: Securing sensitive information during transmission
  • Anti-Crawling: Preventing automated data extraction
  • Authentication: Verifying user identity and requests
  • Data Integrity: Ensuring data hasn't been tampered with

As a crawler developer, understanding these algorithms helps you:

  • Identify encryption types quickly
  • Implement appropriate decryption methods
  • Debug and troubleshoot encrypted requests
  • Build more robust crawling solutions

Common Encryption Algorithms and Their Features

1. Base64 Encoding

Characteristics:

  • Output contains only A-Z, a-z, 0-9, +, /, and = (padding)
  • Length is always a multiple of 4
  • Ends with 0-2 padding characters (=)

Python Implementation:

python
import base64

# Encoding
def encode_base64(text):
    """Encode text to Base64"""
    encoded = base64.b64encode(text.encode('utf-8'))
    return encoded.decode('utf-8')

# Decoding
def decode_base64(encoded_text):
    """Decode Base64 to text"""
    decoded = base64.b64decode(encoded_text)
    return decoded.decode('utf-8')

# Example
original = "Hello, Crawler!"
encoded = encode_base64(original)
print(f"Encoded: {encoded}")  # SGVsbG8sIENyYXdsZXIh

decoded = decode_base64(encoded)
print(f"Decoded: {decoded}")  # Hello, Crawler!

Identification Tips:

  • Look for strings ending with = or ==
  • Check if characters match Base64 alphabet
  • Length divisible by 4

2. MD5 Hash

Characteristics:

  • Always produces 32-character hexadecimal string
  • Fixed output length (128 bits)
  • One-way function (cannot be reversed)

Python Implementation:

python
import hashlib

def md5_hash(text):
    """Generate MD5 hash"""
    return hashlib.md5(text.encode('utf-8')).hexdigest()

# Example
text = "crawler_secret_key"
hash_value = md5_hash(text)
print(f"MD5 Hash: {hash_value}")  # 32-character hex string

Identification Tips:

  • 32-character hexadecimal string
  • Lowercase letters and numbers (0-9, a-f)
  • Used for checksums, signatures, and password hashing

3. SHA-256 Hash

Characteristics:

  • 64-character hexadecimal string
  • Fixed output length (256 bits)
  • More secure than MD5

Python Implementation:

python
import hashlib

def sha256_hash(text):
    """Generate SHA-256 hash"""
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

# Example
text = "crawler_data"
hash_value = sha256_hash(text)
print(f"SHA-256 Hash: {hash_value}")  # 64-character hex string

Identification Tips:

  • 64-character hexadecimal string
  • Commonly used in modern applications
  • More secure alternative to MD5

4. AES Encryption

Characteristics:

  • Block cipher (128-bit blocks)
  • Key sizes: 128, 192, or 256 bits
  • Output is binary (often Base64 encoded)

Python Implementation:

python
from Crypto.Cipher import AES
from Crypto.Util.Padding import pad, unpad
import base64

def aes_encrypt(plaintext, key):
    """Encrypt using AES-256"""
    cipher = AES.new(key, AES.MODE_CBC)
    padded = pad(plaintext.encode('utf-8'), AES.block_size)
    ciphertext = cipher.encrypt(padded)
    iv = cipher.iv
    return base64.b64encode(iv + ciphertext).decode('utf-8')

def aes_decrypt(ciphertext, key):
    """Decrypt AES-256 encrypted data"""
    data = base64.b64decode(ciphertext)
    iv = data[:16]
    ciphertext = data[16:]
    cipher = AES.new(key, AES.MODE_CBC, iv)
    plaintext = unpad(cipher.decrypt(ciphertext), AES.block_size)
    return plaintext.decode('utf-8')

# Example
key = b'12345678901234567890123456789012'  # 32 bytes for AES-256
plaintext = "Crawler secret data"
encrypted = aes_encrypt(plaintext, key)
print(f"Encrypted: {encrypted}")

decrypted = aes_decrypt(encrypted, key)
print(f"Decrypted: {decrypted}")

Identification Tips:

  • Often Base64 encoded
  • Requires key and IV (initialization vector)
  • Block size is 16 bytes (128 bits)

5. RSA Encryption

Characteristics:

  • Asymmetric encryption (public/private key pair)
  • Variable output length
  • Commonly used for key exchange

Python Implementation:

python
from Crypto.PublicKey import RSA
from Crypto.Cipher import PKCS1_v1_5
import base64

def rsa_encrypt(plaintext, public_key_pem):
    """Encrypt using RSA public key"""
    key = RSA.import_key(public_key_pem)
    cipher = PKCS1_v1_5.new(key)
    ciphertext = cipher.encrypt(plaintext.encode('utf-8'))
    return base64.b64encode(ciphertext).decode('utf-8')

def rsa_decrypt(ciphertext, private_key_pem):
    """Decrypt using RSA private key"""
    key = RSA.import_key(private_key_pem)
    cipher = PKCS1_v1_5.new(key)
    data = base64.b64decode(ciphertext)
    plaintext = cipher.decrypt(data, None)
    return plaintext.decode('utf-8') if plaintext else None

Identification Tips:

  • Requires public/private key pair
  • Used for secure key exchange
  • Often combined with symmetric encryption

Practical Crawler Examples

Example 1: Decoding Base64 API Responses

python
import requests
import base64
import json

def crawl_encrypted_api(url):
    """Crawl API with Base64 encoded responses"""
    response = requests.get(url)
    data = response.json()
    
    # Decode Base64 encoded content
    if 'encrypted_data' in data:
        decoded = base64.b64decode(data['encrypted_data'])
        content = json.loads(decoded.decode('utf-8'))
        return content
    
    return data

Example 2: Handling MD5 Signed Requests

python
import requests
import hashlib
import time

def create_signed_request(url, params, secret_key):
    """Create request with MD5 signature"""
    # Add timestamp
    params['timestamp'] = int(time.time())
    
    # Sort parameters
    sorted_params = sorted(params.items())
    
    # Create signature string
    sign_str = '&'.join([f"{k}={v}" for k, v in sorted_params])
    sign_str += f"&key={secret_key}"
    
    # Generate MD5 signature
    signature = hashlib.md5(sign_str.encode('utf-8')).hexdigest()
    params['sign'] = signature
    
    # Make request
    response = requests.get(url, params=params)
    return response.json()

Algorithm Identification Checklist

When analyzing encrypted data in crawlers:

  1. Check Length:

    • Base64: Multiple of 4
    • MD5: 32 characters
    • SHA-256: 64 characters
  2. Check Character Set:

    • Base64: A-Z, a-z, 0-9, +, /, =
    • Hex: 0-9, a-f (lowercase)
  3. Check Patterns:

    • Padding characters (=)
    • Fixed length outputs
    • Encoding indicators
  4. Use Tools:

    • Browser DevTools (Network tab)
    • Python hashlib and base64 modules
    • Online decoders (for testing)

Best Practices

  1. Always Test Decryption: Verify your decryption works before implementing
  2. Handle Errors Gracefully: Encryption can fail - add error handling
  3. Respect Rate Limits: Don't abuse decryption endpoints
  4. Cache Results: Store decrypted data to avoid repeated operations
  5. Use Libraries: Leverage established libraries like pycryptodome


Conclusion

Understanding encryption algorithms is crucial for modern web crawlers. By recognizing algorithm features and implementing appropriate decryption methods, you can build more robust and effective crawling solutions. Remember to always respect website terms of service and use these techniques responsibly.

Key Takeaways:

  • Base64 is encoding, not encryption
  • MD5 and SHA are one-way hashes
  • AES is symmetric encryption (needs key)
  • RSA is asymmetric encryption (needs key pair)
  • Always test your decryption implementations

"The best crawler is one that can adapt to different encryption schemes." — PFinal南丞

Last updated: