At this point you know how to install Julia, run code in VS Code or the REPL, write functions, and create reproducible environments with Project.toml.
In this chapter we will put those skills to work on something you will do constantly in geoscience: reading data from files, reshaping it, and writing results back out. We start with plain text files, move on to delimited numeric data, then to CSV tables with the DataFrames ecosystem, and finish with a tour of specialised geoscientific formats (NetCDF, HDF5, SEG-Y, Shapefiles). We create small example files right here in the code so you can run every block and see the results yourself.
If some of the syntax looks unfamiliar at first, especially the do ... end pattern or the => arrow, don’t worry. These are common Julia patterns that will become second nature with practice. The important thing is to understand what each block does, not to memorise the syntax right now.
2.1 Writing and reading text files
The simplest kind of file is plain text. Let’s create one:
# Create a small text fileopen("stations.txt", "w") do fwrite(f, "Station Latitude Longitude Elevation_m\n")write(f, "AAL 64.35 25.75 120\n")write(f, "BBR 61.50 23.80 95\n")write(f, "CCK 60.17 24.94 15\n")end
33
What happened here? open("stations.txt", "w") creates (or overwrites) a file named stations.txt. The "w" means “write mode.” The do f ... end block is Julia’s way of saying “while the file is open, call it f and do these things, then close it automatically.” Each write call puts a line of text into the file.
Station Latitude Longitude Elevation_m
AAL 64.35 25.75 120
BBR 61.50 23.80 95
CCK 60.17 24.94 15
read("stations.txt", String) loads the entire file into a single string. For a small file like this, that is perfectly fine.
2.1.1 Reading line by line
Sometimes you want to process a file one line at a time, for instance to skip a header or to parse each row:
open("stations.txt", "r") do ffor line ineachline(f)println("→ ", line)endend
→ Station Latitude Longitude Elevation_m
→ AAL 64.35 25.75 120
→ BBR 61.50 23.80 95
→ CCK 60.17 24.94 15
The "r" means “read mode.” eachline(f) gives you one line at a time, which is memory-efficient even for very large files.
2.1.2 Splitting lines into columns
In geoscience data files, columns are usually separated by spaces, tabs, or commas. You can split a line into pieces with split():
open("stations.txt", "r") do f header =readline(f) # read and skip the headerfor line ineachline(f) parts =split(line) # splits on whitespace by default name = parts[1] lat =parse(Float64, parts[2]) lon =parse(Float64, parts[3]) elev =parse(Float64, parts[4])println("$name is at ($lat, $lon), elevation $elev m")endend
AAL is at (64.35, 25.75), elevation 120.0 m
BBR is at (61.5, 23.8), elevation 95.0 m
CCK is at (60.17, 24.94), elevation 15.0 m
split(line) breaks the string into a list of substrings. parse(Float64, ...) converts a text string like "64.35" into an actual number. This pattern (read a line, split it, parse the pieces) is the bread and butter of working with geoscience text files.
2.2 Delimited data with DelimitedFiles
For files with a regular grid of numbers (like gravity measurements or temperature readings), Julia has a built-in module called DelimitedFiles that reads the whole file into a matrix in one step.
Let’s create a small data file and read it:
# Create a space-delimited data file (no header, just numbers)open("gravity.txt", "w") do fwrite(f, "64.35 25.75 -12.3\n")write(f, "64.40 25.80 -11.8\n")write(f, "64.45 25.85 -13.1\n")write(f, "64.50 25.90 -12.7\n")write(f, "64.55 25.95 -14.0\n")end
20
usingDelimitedFilesdata =readdlm("gravity.txt") # reads into a matrixprintln("Size: ", size(data)) # 5 rows, 3 columns
Size: (5, 3)
Now you can pull out individual columns:
lat = data[:, 1] # all rows, first columnlon = data[:, 2] # all rows, second columngz = data[:, 3] # all rows, third columnprintln("Latitudes: ", lat)println("Gravity anomalies: ", gz)
hcat stands for “horizontal concatenate”. It sticks columns together into a matrix. writedlm writes the matrix to a file, using the tab character '\t' as the separator.
2.3 CSV files and DataFrames
CSV (Comma-Separated Values) is probably the most common data format in science. Julia has excellent CSV support through the CSV.jl and DataFrames.jl packages.
NoteInstalling packages
If you haven’t installed these packages yet, press ] in the REPL to enter Pkg mode and type:
add CSV DataFrames
Or from normal Julia code: using Pkg; Pkg.add(["CSV", "DataFrames"]). You only need to do this once.
2.3.1 Creating a DataFrame
A DataFrame is a table: rows and columns, like a spreadsheet. Each column has a name and all values in a column have the same type. If you have used pandas in Python, R’s data.frame, or even Excel, the concept is the same.
# Multiple columns - returns a new DataFrameselect(samples, :borehole, :density)
5×2 DataFrame
Row
borehole
density
String
Float64
1
BH-01
2.65
2
BH-01
2.68
3
BH-01
2.75
4
BH-02
2.63
5
BH-02
2.71
The : before a column name makes it a Symbol, Julia’s way of referring to names. You will see this pattern everywhere in DataFrames.
2.3.4 Filtering rows
Keep only rows that match a condition:
# Samples deeper than 20 mfilter(row -> row.depth_m >20, samples)
2×4 DataFrame
Row
borehole
depth_m
rock_type
density
String
Float64
String
Float64
1
BH-01
30.0
gneiss
2.75
2
BH-02
25.0
schist
2.71
The row -> row.depth_m > 20 part is an anonymous function, a small throwaway function that takes one argument (row) and returns true or false. Don’t worry about the syntax too much; think of it as saying “keep rows where depth is greater than 20.”
# Only granite samplesfilter(row -> row.rock_type =="granite", samples)
3×4 DataFrame
Row
borehole
depth_m
rock_type
density
String
Float64
String
Float64
1
BH-01
10.0
granite
2.65
2
BH-01
20.0
granite
2.68
3
BH-02
15.0
granite
2.63
2.3.5 Adding and transforming columns
# Add a new columnsamples.weight_kg = samples.density .*1000.0# density × 1000samples
5×5 DataFrame
Row
borehole
depth_m
rock_type
density
weight_kg
String
Float64
String
Float64
Float64
1
BH-01
10.0
granite
2.65
2650.0
2
BH-01
20.0
granite
2.68
2680.0
3
BH-01
30.0
gneiss
2.75
2750.0
4
BH-02
15.0
granite
2.63
2630.0
5
BH-02
25.0
schist
2.71
2710.0
The .* (dot-star) applies the multiplication to every row. You saw this broadcasting pattern in Chapter 1.
2.3.6 Grouping and summarising
This is where DataFrames really shine. Group your data by a category and compute summaries:
usingStatistics# Average density per boreholegdf =groupby(samples, :borehole)combine(gdf, :density => mean =>:avg_density)
2×2 DataFrame
Row
borehole
avg_density
String
Float64
1
BH-01
2.69333
2
BH-02
2.67
Read this as: “group the table by :borehole, then for the :density column compute the mean and call the result :avg_density.” The => arrows connect input → function → output name.
# Multiple summaries at oncecombine(groupby(samples, :rock_type),:density => mean =>:avg_density,:density => minimum =>:min_density,:depth_m => maximum =>:deepest)
3×4 DataFrame
Row
rock_type
avg_density
min_density
deepest
String
Float64
Float64
Float64
1
granite
2.65333
2.63
20.0
2
gneiss
2.75
2.75
30.0
3
schist
2.71
2.71
25.0
2.3.7 Joining tables
In real projects your data is often spread across multiple files. Joining brings them together:
# A second table with borehole coordinateslocations =DataFrame( borehole = ["BH-01", "BH-02", "BH-03"], latitude = [64.35, 64.40, 64.45], longitude = [25.75, 25.80, 25.85])# Inner join - only boreholes that appear in both tablesinnerjoin(samples, locations, on =:borehole)
5×7 DataFrame
Row
borehole
depth_m
rock_type
density
weight_kg
latitude
longitude
String
Float64
String
Float64
Float64
Float64
Float64
1
BH-01
10.0
granite
2.65
2650.0
64.35
25.75
2
BH-01
20.0
granite
2.68
2680.0
64.35
25.75
3
BH-01
30.0
gneiss
2.75
2750.0
64.35
25.75
4
BH-02
15.0
granite
2.63
2630.0
64.4
25.8
5
BH-02
25.0
schist
2.71
2710.0
64.4
25.8
innerjoin keeps only rows where the :borehole value exists in both tables. BH-03 has no samples, so it is dropped. If you want to keep all locations even when there are no matching samples, use leftjoin or outerjoin instead.
2.3.8 Handling missing data
Real data has gaps. Julia represents missing values with a special value called missing:
# Create data with a gapobs =DataFrame( station = ["A", "B", "C", "D"], temp_C = [12.3, missing, 14.1, 13.7])obs
4×2 DataFrame
Row
station
temp_C
String
Float64?
1
A
12.3
2
B
missing
3
C
14.1
4
D
13.7
# Drop rows with missing valuesdropmissing(obs, :temp_C)
skipmissing is a wrapper that tells functions to ignore the gaps. Without it, mean would return missing because any operation involving missing propagates the unknown.
2.4 Working with file paths
When your project has many data files, you need to build file paths correctly. Julia’s joinpath function handles this across operating systems (so you don’t have to worry about / vs \):
# Build a pathpath =joinpath("data", "field_campaign", "gravity.txt")println(path)
# Check if a file exists before readingifisfile("samples.csv")println("samples.csv exists!")elseprintln("File not found")end
samples.csv exists!
TipOrganise your data early
Create a data/ folder in your project from the start. Put raw data in data/raw/, processed data in data/processed/. This habit will save you from chaos later, especially when you have 50 stations and 3 field campaigns.
2.5 Searching, copying, and renaming files
In practice, geoscientific data rarely arrives in one tidy folder. You might receive a USB drive with hundreds of files scattered across nested directories, including raw measurements from field instruments, GPS logs, photos, metadata files, all mixed together. A common first task is: find all files of a certain type, rename them consistently, and copy them into a single clean folder.
Julia has everything you need for this built into the standard library, no extra packages required.
2.5.1 Listing files in a directory
# readdir returns the names of everything in a folderfor name inreaddir(".")println(name)end
This is one of those examples that is more useful when you run it in your own project than when I freeze the output here, because the exact listing depends on whatever happens to be in the working directory at render time.
By default readdir gives you the names without the full path. Pass join = true to get complete paths:
for path inreaddir("data/raw", join =true)println(path)end
2.5.2 Searching recursively with walkdir
walkdir is the key function for searching through directories and all their subdirectories. It yields a tuple (root, dirs, files) at each level:
# Find every .csv file anywhere inside data/for (root, dirs, files) inwalkdir("data")for f in filesifendswith(f, ".csv") full_path =joinpath(root, f)println(full_path)endendend
You can wrap this into a reusable function:
""" find_files(dir, ext)Recursively find all files with extension `ext` in `dir` and its subdirectories.Returns a vector of full paths."""functionfind_files(dir, ext) results =String[]for (root, dirs, files) inwalkdir(dir)for f in filesifendswith(f, ext)push!(results, joinpath(root, f))endendendreturn resultsend
Main.Notebook.find_files
# Example: find all .dat files under a field campaign folderdat_files =find_files("data/raw", ".dat")println("Found $(length(dat_files)) .dat files")
2.5.3 Extracting parts of a file path
Julia has several functions for splitting a path into its components:
Suppose you have collected EM data from multiple campaigns and the instrument saved files with unhelpful names like meas_001.dat. You want to copy them into a single processed/ folder, renaming each file to include the station name and date.
Here is a complete worked example. We first create a fake directory tree so you can run everything:
# --- Set up a fake directory tree ---for station in ["EM01", "EM02", "EM03"] dir =joinpath("fake_survey", station)mkpath(dir) # like mkdir -p: creates all intermediate directoriesfor i in1:3 filepath =joinpath(dir, "meas_$(lpad(i, 3, '0')).dat")write(filepath, "fake data for $station measurement $i\n")endendprintln("Created fake survey tree:")for (root, dirs, files) inwalkdir("fake_survey")for f in filesprintln(" ", joinpath(root, f))endend
Now let’s search for all .dat files, rename them to include the station name, and copy them into a clean output folder:
# --- Find, rename, and copy ---output_dir ="collected_data"mkpath(output_dir)for (root, dirs, files) inwalkdir("fake_survey")for f in filesifendswith(f, ".dat")# The station name is the immediate parent folder station =basename(root)# Build a new name: EM01_meas_001.dat new_name = station *"_"* f src =joinpath(root, f) dst =joinpath(output_dir, new_name)cp(src, dst, force =true) # copy; force=true overwrites if existsendendend# Check the resultprintln("Files in $(output_dir):")for f insort(readdir(output_dir))println(" ", f)end
Geoscience uses several specialised binary formats to store large, multidimensional datasets efficiently. Julia has packages for all the common ones. Here is a brief overview. Detailed usage will be added in future updates of this chapter.
2.6.1 NetCDF
NetCDF (Network Common Data Form) is the standard format for climate, ocean, and atmospheric data. Use the NCDatasets.jl package:
usingNCDatasets# Read a NetCDF fileds =NCDataset("temperature.nc")temp = ds["temperature"][:, :, :] # read the full 3D arraylat = ds["latitude"][:]lon = ds["longitude"][:]close(ds)# Or use the do-block pattern (auto-closes)NCDataset("temperature.nc") do ds temp = ds["temperature"][:, :, :]println("Shape: ", size(temp))end
# Write a NetCDF fileNCDataset("output.nc", "c") do dsdefDim(ds, "x", 10)defDim(ds, "y", 20) v =defVar(ds, "elevation", Float64, ("x", "y")) v[:, :] =rand(10, 20)end
2.6.2 HDF5
HDF5 (Hierarchical Data Format) is widely used in geophysics, remote sensing, and seismology. Use HDF5.jl:
usingHDF5# Writeh5open("model.h5", "w") do f f["resistivity"] =rand(100, 100, 50) f["depth"] =collect(0.0:0.5:24.5)end# Readh5open("model.h5", "r") do f rho =read(f, "resistivity") depth =read(f, "depth")println("Model size: ", size(rho))end
2.6.3 SEG-Y
SEG-Y is the seismic industry standard. Use SegyIO.jl:
usingSegyIO# Read a SEG-Y fileblock =segy_read("seismic_line.segy")traces = block.data # matrix of tracesheaders = block.traceheaders # trace header information
2.6.4 Shapefiles and GeoJSON
For vector geospatial data (points, polygons, map boundaries), use Shapefile.jl or GeoJSON.jl:
usingShapefiletable = Shapefile.Table("geology_map.shp")# Access as a DataFrameusingDataFramesdf =DataFrame(table)
WarningThese examples are not executed
The geoscientific format examples above use eval: false style. They show the code patterns but don’t run here because they require actual data files. When you have your own NetCDF, HDF5, or SEG-Y files, copy the pattern and replace the file name.
ImportantWant more formats or deeper coverage?
This section will grow as the book develops. If there is a specific geoscience format or workflow you need (GRIB, GeoTIFF, ASDF, miniSEED, EDI, ModEM, or anything else) please open an issue on GitHub and describe your use case. Your request will help us prioritise what to add next.
2.7 Clean up
Let’s remove the temporary files we created in this chapter:
for f in ["stations.txt", "gravity.txt", "gravity_corrected.txt", "samples.csv"]isfile(f) &&rm(f)endprintln("Cleaned up.")