Converting Weather XML to JSON: Tools & Techniques

How to Parse Weather XML Feeds in PythonWeather data is commonly distributed as XML feeds by meteorological services, government agencies, and private APIs. XML is structured, predictable, and well-suited for representing hierarchical weather information such as forecasts, observation stations, timestamps, and nested elements like temperature, wind, and precipitation. This article shows practical, production-minded ways to parse weather XML feeds in Python — from simple quick scripts to robust solutions handling namespaces, streaming large feeds, and converting to JSON or pandas for analysis.


When to expect XML weather feeds

Many providers still use XML (e.g., NOAA, Met Office, some WMS/WFS services, custom SOAP-based APIs). XML feeds are common when:

  • the data provider offers rich hierarchical metadata,
  • strict schemas (XSD) ensure consistency,
  • legacy systems remain in place.

If you have the choice, JSON is often simpler to work with, but XML remains prevalent in official meteorological services.


Libraries to consider

  • xml.etree.ElementTree (stdlib) — convenient, minimal dependencies, good for small to medium documents.
  • lxml — faster, supports XPath, XSLT, robust namespace handling.
  • xmltodict — converts XML to Python dicts (convenient for quick conversion to JSON).
  • BeautifulSoup (with “xml” parser) — forgiving parser for messy XML.
  • defusedxml — security-hardened variants (DefusedXML) — use when parsing untrusted XML to prevent XML external entity (XXE) attacks.
  • requests — for fetching feeds over HTTP(S).
  • aiohttp / httpx (async) — for asynchronous fetching.

Use defusedxml or lxml with secure parser settings when you cannot trust the feed source.


Example feeds and formats

A typical weather XML snippet may look like:

<weatherstation>   <name>Downtown</name>   <observation_time>2025-09-03T10:00:00Z</observation_time>   <temperature unit="C">18.3</temperature>   <wind>     <speed unit="m/s">5.2</speed>     <direction degrees="270">W</direction>   </wind>   <precipitation last_hour="0.0" /> </weatherstation> 

Other feeds use more complex schemas, namespaces, or nested forecast periods:

<product xmlns="http://www.example.org/schema">   <time from="2025-09-03T12:00:00Z" to="2025-09-03T18:00:00Z">     <location>       <temperature unit="C" value="20"/>       <windDirection deg="260" code="W"/>     </location>   </time> </product> 

Namespaces and attributes are common; robust parsing must handle them.


Basic parsing with ElementTree (stdlib)

This is sufficient for well-formed, reasonably sized XML.

  1. Fetching the feed:
import requests from xml.etree import ElementTree as ET url = "https://example.com/weather.xml" resp = requests.get(url, timeout=10) resp.raise_for_status() root = ET.fromstring(resp.content) 
  1. Extracting simple values:
name = root.findtext('name') temp = root.findtext('temperature') wind_speed = root.find('wind/speed').text  # or root.findtext('wind/speed') 
  1. Handling attributes:
temp_elem = root.find('temperature') temp_value = float(temp_elem.text) temp_unit = temp_elem.get('unit') 

ElementTree does not handle XML namespaces automatically; you’ll need to supply namespace-aware paths.


Handling namespaces

If the document uses namespaces, ElementTree requires you to include them in search paths:

<product xmlns="http://www.example.org/schema">   <time>...</time> </product> 

Use a namespace map:

ns = {'m': 'http://www.example.org/schema'} times = root.findall('m:time', ns) for t in times:     loc = t.find('m:location', ns)     temp = loc.find('m:temperature', ns).get('value') 

lxml provides better XPath and namespace support:

from lxml import etree doc = etree.fromstring(resp.content) temps = doc.xpath('//m:temperature/@value', namespaces={'m': 'http://www.example.org/schema'}) 

Converting XML to dict/JSON quickly

For many applications, converting the XML to a Python dict or JSON simplifies downstream processing.

xmltodict example:

import xmltodict, json doc = xmltodict.parse(resp.content) json_str = json.dumps(doc) 

Note: xmltodict produces ordered dicts and represents attributes with an “@”, e.g., {“temperature”: {“@unit”: “C”, “#text”: “18.3”}}. Post-processing often simplifies that shape.


Streaming large feeds

For very large feeds, avoid loading the entire document into memory. Use iterative parsing:

ElementTree iterparse:

import xml.etree.ElementTree as ET context = ET.iterparse('huge_weather.xml', events=('end',)) for event, elem in context:     if elem.tag == 'weatherstation':         # process station         station_name = elem.findtext('name')         # ... store or stream out         elem.clear()  # free memory 

lxml.etree.iterparse is similar but faster and supports clean-up patterns for namespaces and parent clearing.


Robust production tips

  • Validate against a provided XSD when available to catch schema issues early.
  • Handle missing or malformed numeric fields with try/except and sensible defaults.
  • Respect rate limits and use caching for frequently-requested resources.
  • Log parsing errors with enough context (feed URL, element path, sample content).
  • Use timeouts and retries with exponential backoff when fetching feeds.
  • Normalize units (e.g., always convert temperatures to Celsius or Kelvin) and record original units.
  • Sanitize strings and escape characters before converting to JSON.
  • When exposing parsed data to clients, explicitly control which fields are included.

Example — end-to-end script: fetch, parse, convert to pandas

import requests import xmltodict import pandas as pd url = "https://example.com/weather.xml" r = requests.get(url, timeout=10) r.raise_for_status() doc = xmltodict.parse(r.content) # This depends on feed structure. Example adaptation: stations = doc['weatherdata']['station']  # might be a list or dict rows = [] for s in (stations if isinstance(stations, list) else [stations]):     rows.append({         'id': s.get('id'),         'name': s.get('name'),         'temp_C': float(s['temperature']['#text']),         'temp_unit': s['temperature'].get('@unit', 'C'),         'obs_time': s.get('observation_time')     }) df = pd.DataFrame(rows) print(df.head()) 

Adjust paths to match the actual XML structure. Use pandas for time-series resampling, grouping by station, or merging with geospatial data.


Handling edge cases

  • Mixed content and CDATA: xml parsers handle CDATA transparently; use .text or proper XPath.
  • Multiple elements vs singletons: many XML-to-dict conversions produce a list for repeated elements and a dict for a single occurrence. Normalize by coercing non-list to list when iterating.
  • Time zones and timestamps: parse with dateutil.parser.parse and normalize to UTC, or use Pendulum for better timezone handling.
  • XML entities and encodings: ensure bytes are decoded correctly; requests handles encoding from HTTP headers but double-check the feed’s declared encoding if you see garbled text.

Security considerations

  • Disable external entity resolution to prevent XXE. Use defusedxml or configure lxml’s parser to forbid DTDs/external entities.
  • Treat feeds as untrusted input — validate and limit sizes to avoid denial-of-service via huge payloads.
  • Prefer HTTPS to avoid man-in-the-middle tampering.

Converting parsed data to useful outputs

  • GeoJSON for mapping station points (include lat/lon attributes).
  • CSV or Parquet for storage and analytics.
  • JSON REST endpoints for downstream apps.
  • Time-series databases (InfluxDB, TimescaleDB) for long-term storage and visualization.

Quick reference snippets

  • Namespace-aware ElementTree find:
ns = {'m': 'http://www.example.org/schema'} elem = root.find('m:location/m:temperature', ns) 
  • Safe parsing with defusedxml:
from defusedxml.ElementTree import fromstring root = fromstring(xml_bytes) 
  • Iterative parsing pattern with clearing:
for event, elem in ET.iterparse('file.xml', events=('end',)):     if elem.tag == 'weatherstation':         process(elem)         elem.clear() 

Conclusion

Parsing weather XML feeds in Python ranges from simple scripts using ElementTree to robust systems using lxml, streaming parsers, validation, and secure parsing practices. Choose tools based on feed size, complexity (namespaces/XSD), and security requirements. Converting XML into normalized JSON or pandas DataFrames often simplifies analysis and downstream usage.


If you want, I can: provide a ready-to-run script tailored to a specific public weather XML feed (give me a URL), show how to validate against an XSD, or convert a sample XML you provide into a cleaned JSON/pandas schema.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *