Announcing New Synergy/DE Feature Release and Support for .NET 6+ on Linux
August 16, 2023Announcing SDI 2023.09.1521
September 15, 2023Source control is an indispensable tool in software development. Being able to track changes in source files gives you freedom to make changes without worrying about being able to return to a stable state and allows you to review the history of changes to the software. The usefulness increases as more people are working on the same software product, as it’s important to coordinate between different developers and make sure that everyone is working on the same code base without accidentally interfering with one another. Source control systems help with these issues.
But source control systems don’t all work the same way. All systems have some idiosyncrasies that may make them difficult to work with at times, and some are better suited for some organizations than others. Even if one system is firmly established in your organization, it can be worthwhile from time to time to reevaluate your needs and see if switching to a different system might be advantageous.
Synergex went through this process recently. Different parts of the source base for our products had been split between three different source control systems: PVCS, Subversion (SVN), and Git. To consolidate and simplify our processes, we decided to switch everything to Git.
A lot could be written about a change like this, addressing the reasons for making such a change and the process of transferring the history of a source base from one control system to another. But in this article, I’ll instead address the challenges of changing source control paradigms for day-to-day use, trading the familiar advantages and limitations of one system for the unfamiliar. In particular, I’ll focus on some of the Git concepts and commands I learned while making this transition.
Source control paradigms
The three source control systems previously used by Synergex represent vastly different approaches. The system I used most frequently was PVCS, which is a centralized system based around tracking the versions of individual files. Any source file in a local work area has a corresponding archive file on the PVCS server, which tracks the history of changes to the source file. Work files are typically read-only, to prevent accidental modification. You can lock a file If you want to modify it, which makes your local copy writable and prevents anyone else from locking the file or committing changes. After editing a file, you can commit the changes, which automatically releases the lock and allows others to edit the same file. Because the version control is file-based, you can easily check out individual files without cloning the entire source tree, or get an updated version of a file without affecting the rest of the work area.
SVN tracks changes to an entire source repository, rather than keeping individual version archives for individual files. But it still maintains a client/server model, so there’s only one master copy of the archive. Each developer only has a working copy of the source base, and all changes must go through the server. By contrast, Git also tracks entire repositories but uses a distributed model, where each working copy of a repository also contains a copy of the archive itself, with all the history of changes to the various files. A developer can commit a change to a local repository without it immediately affecting any other developer.
I have less experience with SVN and its model of source control than with either PVCS or Git. Since PVCS and Git show a greater contrast, I’ll focus on the differences between them.
Potential PVCS problems
The locking system in PVCS can lead to collaboration issues, as one developer may keep a lock on a file that someone else needs to edit. While it’s possible to release a lock on a modified file without committing or losing the changes, this can cause its own problems, as a file can become out of date, and committing the changes later on can inadvertently overwrite someone else’s changes.
Another difficulty that PVCS can cause occurs with grouping changes together. Adding a feature or fixing a bug can often require modifying multiple files. But while changes to multiple files can be committed at the same time, the versions of each file are tracked separately, which can make it hard to keep track of the changes for a given feature or fix. The typical approach is to use the same label for all of the files related to the change. So if I modify files A, B, and C to implement Tracker #1234, the files might be at versions 1.21, 1.25, and 2.1.3 respectively, but all of those versions can have a “TR#1234” label, making it easier to keep track of them later. But labels are optional, and a developer may forget to apply a label to one or more of the files modified for a given change. And since the files might be spread out over different directories, another developer might get updated versions of files A and C but overlook file B, resulting in unexpected behavior or a build error.
Our use of PVCS also made it difficult to keep track of incomplete code changes. While PVCS does have support for branching versions for files, the feature is complicated, and we rarely used it. Instead, we typically committed changes to the most recent version of a file. This meant that changes should be complete before being committed, or else they might make things difficult for someone else. This could be especially challenging when changes made on one platform failed to build or caused unexpected problems on another platform.
Git and distributed source control
The distributed nature of Git may be the biggest hurdle when transitioning from a system like PVCS. There is no longer a single server that tracks the history for files. Instead, each working copy of a Git repository contains a copy of the archive. So the working copy has direct access to the full history (or partial history, if it’s configured that way) of its files and is responsible for updating that history when the files change. This gives developers a lot more freedom than PVCS, as one developer working on a given source file will no longer prevent other developers from working on the same file.
Of course, this additional freedom brings its own challenges as well. For instance, it becomes more likely that two people will make contradictory changes to the same file. Fortunately, Git does a fairly good job at merging changes in files when merging different branches, but at times there are irreconcilable differences, and you’ll have to manually edit a file to get it right.
While Git is distributed and doesn’t rely on a coordinate server like PVCS or SVN, it does have concepts that serve a similar function of enabling developers to get the latest versions of source files from each other and to share their own changes. A working copy of a repository can be cloned from and/or configured to track a remote Git repository, which might be hosted on the internet or on a local network. The local repository can fetch changes from the remote repository so that it knows if any branch has been updated. It can then merge changes from a branch on the remote repository into its own copy of that branch. A pull operation combines fetching and merging, allowing a local branch to be updated to match the remote branch.
Changes made locally can also be shared with the remote repository. In some cases, this will be done through a “pull request,” which sends the changes to a branch along with a request for someone to review the changes and (assuming the changes are accepted) pull them into the remote branch. Alternatively, the system could be set up so that a local copy can simply do a “push” to update the remote repository, without needing a pull request. Git requires all changes from the remote repository to be merged into the local branch before allowing a push.
Because working copies of Git repositories include copies of the archives themselves, they can end up taking up significantly more disk space than an equivalent work area using a centralized source control. There are ways to reduce this size, such as limiting the depth when cloning a repository, to reduce the amount of history that the repository stores. But I haven’t had much success with this. It seems that a shallow clone like this initially omits all branches except the current one. But pulling from the remote repository will bring in any other branches, increasing the local archive to nearly full size without bringing back the history of the current branch that was initially sacrificed. Unless I’m missing something (or Git’s functionality greatly improves in this area), I wouldn’t recommend shallow clones for general use. A better way to reduce the size of a working copy might be to use a sparse checkout. (More on that later.)
Git and commits
While the difference between centralized and distributed source control may be the most significant change when moving from a system like PVCS to Git, the way the systems store version information is nearly as important. PVCS is based around files. The server has archive files with names similar to the equivalent work files (typically differing only in the extension). The archive files are likely in a directory structure that reflects that of the work area, and each archive file contains all the version information for the equivalent work file. If a change is committed, the archive file will be updated and will assign a new number to the updated version, typically an increment of the previous number. Everything in PVCS is centered around files. While some commands can affect multiple files at once, they must act on the files themselves, rather than on the work area as a whole.
Git, on the other hand, is focused on the entire repository. It tracks versions through the use of commits, which represent the state of the repository at the time when certain changes were committed. As I understand it, a commit stores the deltas, or changes, in the repository: lines added or deleted; files added, removed, or renamed; etc. A single commit can contain a large number of changes to multiple files. By following the changes, Git can recreate the repository in the state it was at any point in its history. Branches and tags point to specific commits. While it’s possible to find the version of an individual file at a given point in the repository’s history, this is not done by looking at the specific file’s history but by looking at the history of the repository as a whole.
As a result, versions aren’t necessarily sequential. While you can apply your own sequential version numbers to commits using tags, the commits themselves are identified by a 40-character hexadecimal hash, e.g., “df7d3431b8c73e7b7dac13c4eb318cf70c1a4d07”. Fortunately, you don’t have to use the full hash to identify a commit; you can use a shorter prefix of the hash as long as it unambiguously identifies a single commit. If Git complains that a hash is ambiguous, you can add more characters to get to one that’s unambiguous. A hash as short as four characters can be valid, although this might cause collisions in a repository with a large number of commits. Different Git tools and interfaces use prefixes of eight to eleven characters (e.g., “df7d3431” or “df7d3431b8c”), so those should be safe.
A common task when using source control is to compare changes. Git allows you to compare files directly in the terminal using “git diff filename” or to register another program as a difftool to compare files (“git difftool filename”). These simplified commands work fine if you just want to compare a modified copy of a file with the latest archived version. You can also use the HEAD identifier (“git difftool HEAD — filename”) to explicitly state that you want to compare the file against the HEAD—the commit that the working area is currently pointing to. (Specifying a commit like HEAD is required if you want to compare a file that you’ve already staged to commit; otherwise, it won’t detect differences.)
But what if you wanted to compare a file with a much older version or to compare two older versions of a file? This would be trivial in PVCS. For example, a command like “vdiff -r-3 -r-2 filename” would compare the revision of a file three versions before the current one with the revision two versions before the current one. Git has a similar concept: “HEAD~1” means one commit behind HEAD, “HEAD~2” means two commits behind, etc. But because Git considers the repository as a whole rather than the effect on individual files, a given file may be completely unchanged in most commits. So something like “git diff HEAD~3 HEAD~2 — filename” may not generate any output, because the file was unchanged between those two commits.
To compare two versions of a file in Git, you need to identify the commits where those two versions existed. If you have known tags for the commits, they should work fine to identify the different versions of the files. Otherwise, you may want to run a command like this:
git --no-pager log --date=short --pretty=format:"%h - %cd - %cn - %s" -- filename
It may look complicated, but it’s just a log command to examine the history of a given file and to summarize the information so each commit will hopefully fit on a single line. (You may want to redirect the output if a file has had too many revisions.) You can examine the output and identify the commits you want to compare, and then copy the abbreviated hashes from the first column into a diff command, e.g., “git diff df7d3431b8c 318cf70c1a4 — filename.” I ended up writing a script that does this automatically, based on a given number of file revisions before the present, essentially recreating the functionality from PVCS.
In PVCS, a file isn’t checked out into a local copy unless you explicitly check it out (individually or as part of a group, e.g., based on file extension). Conversely, Git will automatically check out all files in a repository unless you exclude them. You can exclude files from a working copy of a repository by creating and editing the .git/info/sparse-checkout files, adding the names of files you want to include or exclude, and enabling sparse checkout. Likewise, Git automatically tracks all files in the local work area, even if they aren’t part of the repository. To ignore local files, create a .gitignore file in the root directory of the working area and add the files you want it to ignore. See the Git documentation for more information.
Changes in workflow
The workflow for implementing a code change in PVCS goes something like this:
- Get a writeable copy of the files to be modified. This has the side effect of making sure you have the latest copies of the files. You may want to lock the files at this point, but if the change can take a while to implement and someone else might need to modify the same files, it might be better to wait. On the other hand, someone else may have already locked the files, in which case you’ll have to wait or negotiate with the other developer to determine who needs to have the lock.
- Make the code changes and do any building and testing necessary to ensure everything works.
- Lock the files if you haven’t done so already.
- Compare the files with the archived versions. Make sure all changes are correct (and see if anyone else has made other changes to the archived files that need to be incorporated).
- Put (commit) your changes.
- After enough review and testing has occurred to confirm the change is ready for release, add a label to the committed versions of the files, adding them to the appropriate release branch.
For Git, many workflows are possible. But the following roughly approximates the steps above:
- Pull from the remote repository to make sure your local files are up to date (“git pull”).
- Make the code changes and do any building and testing necessary to make sure everything works.
- Add each of the modified files to the current change set, staging them for a commit (“git add filename”).
- Pull from the remote repository again in case there were any changes since last time. Before you can push a modified branch back to a remote repository, it must be up to date with the latest remote changes. You could commit your changes and then pull to merge in any remote changes, but this results in automatic commits that can make it hard to track the actual history of file changes. The history is easier to follow if you make sure your local branch is up to date before you commit anything.
- If there are any remote changes specifically to files you’ve modified, the pull will fail and you’ll be prompted to commit or stash your changes. You can stash the changes (“git stash push”) to save copies of all modified local files and revert the working copies to the current HEAD versions. Then you can retry the pull operation and restore the modified working copies (“git stash pop”). If a merge conflict occurred between two versions of a source file, Git will inform you and allow you to resolve it the same way you would resolve a merge conflict between two commits. Note that files restored from a stash will automatically be staged for a commit.
- Check the status of the local branch (“git status”) to make sure it’s up to date with the remote branch and the correct files are staged for commit. If there’s a staged file unrelated to the change you’re making, be sure to unstage it before committing (“git restore –staged filename”).
- Compare the staged files with the previous versions to make sure the changes are correct, e.g., “git diff HEAD filename” (or “git difftool HEAD filename”, if you have a difftool configured).
- Commit the changes (“git commit -m“Message””) and push them to the remote branch (“git push”).
- After enough review and testing has occurred to confirm the change is ready for release, cherry-pick them to the appropriate release branch. For example,
- In a working area tracking the desired release branch, pull from the remote branch.
- Identify the hash for the commit where the changes occurred. If you don’t remember what it was, you can examine the log of the branch where the commit took place. For example, if you made the commit on a branch named “trunk,” you can run “git log origin/trunk” to see the history of commits specifically on that branch. You can even run “git log origin/trunk filename” to limit the results to commits that modified the specified file.
- Cherry-pick the specified commit into the current branch (“git cherry-pick commit_hash”). This will bring over the changes to the files and create a new commit. If the separate branches are maintaining separate copies of the files with other differences, these should be unaffected. Only the parts of the files changed in the specified commit on the feature branch will be brought over to the release branch.
- Review the changes and push to the remote repository.
- If the change on the original branch required more than one commit, you can cherry-pick multiple commits at once. You may want to disable automatic commits in the cherry-pick operation (e.g., “git cherry-pick -n commit_hash_1 commit_hash_2”). Then you can review the cumulative changes and commit and push them if everything looks good.
This workflow assumes that all development is being done on a single feature branch. A better approach could be to create separate branches for individual features. These branches can be short-lived; you can switch between them as necessary, and you can commit work in progress to them as you go. Once development on the feature branch is finished, you can merge it back into a long-running branch.
Hints and hazards for Git newbies
Like other software, Git has its own peculiarities. If you don’t know what you’re doing, you could end up in trouble, or at least confused. I ran into a number of issues when I started using Git, some of which may have been exacerbated by my PVCS background. Here are a few things I noticed that might be helpful for others.
Don’t run Git commands as root
This is specifically an issue on Unix systems (including Linux). There are times when it may be helpful or necessary to use the root account as part of software development. In such times, you may be tempted to run a Git command to update the local repository or to check on its status or history. Don’t do it! Git commands—even innocuous ones like “git status”—can modify and change the ownership of files in the “.git” directory. If this happens as root, the files can become inaccessible for non-root users, causing subsequent commands to fail. If this occurs, you’ll need to go to the .git directory and use root privileges to change the ownership and group of all affected files (which may go several directories deep).
You can’t store empty directories in Git
If you need an empty directory (e.g., to store some sort of output file), either set up a process to create it after cloning a repository or add a dummy file so it will be created automatically.
Files archived in Git will retain their executable file attributes on Unix
If you want to include an executable shell script in a Git repository, set it to executable before you stage and commit it. If it was committed without being executable (e.g., it was created on Windows), you can change the attribute on a Unix system and commit it again, and it should be executable on any other Unix system that clones the repository. You can then edit the file and commit changes on Windows without affecting the attributes on Unix.
Working copies of files use the system’s native line endings
If you commit a file on Windows and check it out on Unix, the copy on Unix will have lines that end in LF. If you check out the same file on another Windows system, its lines will end in CR LF. If you need files to be exact binary copies (including line endings) you’ll need to treat them as binary files, which goes beyond the scope of this article.
Press “q” to leave the output of a Git command
When the output from a Git command is larger than will fit on a single terminal screen, Git automatically puts it through a pager program to let you view one page at a time. This can be helpful, but it can also be annoying, especially when you’re looking at a long history for a file and just need to see the most recent commits. It can be even more annoying on Windows, where the pager program doesn’t automatically exit when it reaches the end of the output it’s displaying and doesn’t respond to common escape sequences. Fortunately, pressing “q” will exit the program. You can also use the “--no-pager
” option to bypass the pager program altogether (e.g., “git –no-pager log”). This may end up generating too much output, so consider piping it to another program to make it easier to handle.
Conclusion
All types of software can have advantages and disadvantages. I think the benefits of a distributed source control system like Git outweigh the problems and made the switch worthwhile for us. But Git has a learning curve, and the transition has not been without its challenges. Hopefully this article has provided some clarity that makes it easier for you to adopt Git or a similar system, or to decide whether it’s the best option for your organization.