How to format codes for publication
I review a lot of geoscience papers that do something computational, here are some suggestions when writing a paper that has a strong computational component.
It has become common across geoscience, and much the rest of science, to write papers that include some codes. This could be as simple as a few lines of python in a script, or a notebook that produces all the visualizations, or it could be a full library that processes data in some way or produces a neural network or numerical model. I have noticed that across the papers I review, there is a wide variability in the quality of codes that are attached to a paper. In this newsletter I will try to offer practical advice on how to write codes that will not get you hung up in reviews so that reviewers can focus on the science. I will use python examples, however these examples apply to any programming language that you may use to do science.
Code should work
Codes attached to a paper should run out of the box. The code represents some kind of result of your work, if reviewers (and subsequently readers) cannot even run the codes you have attached to your paper you have already failed to get buy-in to your ideas. In fact, in some cases I have wondered if authors are attempting to hide some result in their paper by attaching code that doesn’t work out of the box. Usually these problems are very simple to solve. I often assume that they occur because the authors have written their codes in one location and now are copy and pasting them into another location to prepare for publication in a journal that may require a code contribution.
Here are some (anonymized) examples from various papers I have reviewed:
Using functions that don’t exist
import numpy as np
x = np.random.random(200)
x.sums()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 x.sums()
AttributeError: 'numpy.ndarray' object has no attribute 'sums'
A function is used that doesn’t exist. In this particular case a numpy array has the function sum()
but not sums()
. This happens when authors write code but never run the code themselves before submitting to a journal.
Declaring variables without assignment
x = # typical values are between 5 and 10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 x.sums()
AttributeError: 'numpy.ndarray' object has no attribute 'sums'
In [4]: x = # typical values are between 5 and 10
Cell In[4], line 1
x = # typical values are between 5 and 10
^
SyntaxError: invalid syntax
An assignment operator is used without anything being assigned to a variable with the user being prompted to input values themselves through an inline comment. This is a poor design pattern and essentially asks the reader of the code to understand what the scientist was thinking when they wrote the code to begin with. This also prevents reproducibility since we have no knowledge of what values the authors used in their original study. Finally, this code will not work and produces a syntax error.
Using imports that have not been imported
x = np.random.random(200)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 x = np.random.random(200)
NameError: name 'np' is not defined
In this example code was presented that uses an imported library but that library was not imported in the code provided. In this case there is the missing line:
import numpy as np
Code should be documented
When writing code, especially functions, the code should be documented. These comments are especially helpful for reviewers and readers in who are reading the code who may be unfamiliar with the particular data processing procedure but perhaps are familiar with the physics or geology that is being investigated in the paper. Your goal is to get acceptance of your ideas by other scientists. Well documented code helps you do this as it allows your codes to be readily interpreted by other scientists.
Python, and many other languages, allow you to create docstrings. These docstrings can then be converted automatically into documentation when building a larger library using tools like Sphinx or Mkdocs, or they simply can serve as descriptors of what code is doing in a script or notebook environment (you could also document the code using markdown in a juypter notebook).
Below is an example of a docstring that explains a simple function to calculate a spectrogram using pytorch. It also includes inline comments.
import numpy as np
import torch
import torchaudio
def calculate_spectrogram(waveform, sample_rate, win_length=None, hop_length=None):
"""
Calculate and plot a spectrogram using PyTorch's torchaudio.
Parameters:
waveform (torch.Tensor): The raw audio waveform as a tensor.
sample_rate (int): The sample rate of the audio waveform.
win_length (int or None): Window size (default: None, uses n_fft).
hop_length (int or None): Hop length between STFT windows (default: None, uses win_length // 2).
Returns:
torch.Tensor: Spectrogram tensor.
"""
# Convert the data to a pytorch tensor if not currently a tensor
waveform = torch.tensor(waveform)
# Dynamically estimate n_fft based on sample rate
n_fft = int(sample_rate //4)
# Default win_length and hop_length if not provided
if win_length is None:
win_length = n_fft
if hop_length is None:
hop_length = int(win_length * 0.05) # Typically 50% overlap
# Create the Spectrogram transform
transform = torchaudio.transforms.Spectrogram(
n_fft=n_fft,
win_length=win_length,
hop_length=hop_length,
power=2.0 # Use power=2 for power spectrogram
)
# Apply the transform to the waveform
spectrogram = transform(waveform)
return np.array(spectrogram)
Codes include arbitrarily long notebooks
Often times in science you are writing code to explore data, explore a concept with a new model, or to build a bespoke data pipeline that may only be used once. Across all of these activities you may simply not know what you will do next because you are testing and exploring ideas against your model of how you think your system of study evolves in time and space. In practice, this is often seen through monolithic notebooks that contain everything including data processing, function and class declarations, etc. I know this because I have done this myself! And it makes a lot of sense in some cases to have an ever growing notebook that is being developed quite differently from the way one would develop software.
Scientific coding is different from software development (this topic deserves its own post!). I like to think of science as trying to hold onto a hose that has a lot of water running through it. The hose represents your ability to explore the ideas and concepts that interest you, and the chaotic behavior of the end of the hose where the water comes out represents the high entropy of ideas that you are trying to control down into a publon. Your goal as a scientist is to control the hose to wash away the dirt and grime from your ideas and find a gem. This metaphor then lends itself directly, at least from my point of view, to building a monolithic jupyter notebook during your scientific exploration.
The solution here is to spend some time developing a module that sits separately from your notebook. This module should contain all of the functions and classes developed for data analysis. Then you can produce a separate and new notebook that presents the exact results associated with your paper. A good example of this comes from this paper, Model scale versus domain knowledge in statistical forecasting of chaotic systems and it’s associated github repo. Which leads to my next point.
4. Use github (or something like it)
I prefer github and use it across all of my projects. It both has tools to support scientific communication (e.g., notebook rendering in browser) and also allows you to create webpages associated with your work for free (e.g., a documentation website for the python library you have developed). It integrates with zenodo so you can attach codes directly from your repo to a paper which is required by many journals while still allowing you to develop your codes separately from the paper you are writing. Git and Github also is a professional skill that is necessary for most computationally related jobs you will get after completing your PhD. Github allows readers and reviewers to explore commit histories on codes, assuming the authors developed their codes using github, in the event that those codes may have failed. I cannot recommend to other scientists enough to learn to use git and github for their science. If people have objections to this, I would be happy to hear an alternative solution (e.g., Gitlab). But I have yet to see a system that offers so much for essentially free for scientists to track progress in work, advertise results, build webpages, etc.
Conclusion
This is not an exhaustive list. I think there are many things I could add to it. But if you write code that works and is well documented you already will be at the top of the stack of papers that I will want to review. Bad coding practices in science makes it harder for others to accept your ideas as it stands in the way between you and others by limiting your ability to communicate these ideas. This is a travesty for a scientist as a central goal of their day to day job is to gain acceptance in the scientific community of their ideas of how the universe works. Happy coding!