Build vs buy: what nobody tells you about running your own procurement data pipeline
- Ian Makgill
- Guides
- 28 May, 2026
- 09 Mins read
If you're thinking about gathering government contract opportunities the most pressing question is "how hard can it be to just scrape this ourselves?"
It's the right question to ask, but your context matters, for many people building your own procurement data feed is the right answer, for others, buying is a much better option.
This piece will help you answer the question "when should I build my own procurement data feed?" We'll explain when that's a good idea and when it makes sense to buy a service like the Open Opportunities API. We talk about the build phase and what it actually takes to keep a procurement pipeline working over time.
Easy to do badly, harder to do well
Building a basic version of this is genuinely not difficult. If you have access to Claude and an AWS account, you can write scrapers and a cron job that will populate a nice little database. A bit more time with Claude and you can put a search interface on top. Add some microservices and an LLM and you can filter and route results to Teams or Slack. Ten days and a few tokens will give you a working beta application for tender alerts, one that will probably outperform many aggregation services.
There are broadly two issues that govern your decision: complexity and reliability.
If you only care about a small number of publishers, let's say 6 or 7, and those publishers have well-structured data feeds with data uniformity (same language, same currency), then you're looking at a very simple system, that's a definite case for build not buy.
The other factor is reliability, if you're going to work on something that is temporary or something that can work on the basis of reactive maintenance, then again the best option is to build your own. So if you just need a sense of the landscape rather than a live feed of every opportunity, or you're ok with one of your feeds being broken for a few days, then building is the right option.
The build-vs-buy conversation gets harder when these conditions change.
What changes when coverage matters
So if you need to gather data from more than a few sources or you need to rely on that data being right, it becomes harder to justify building your own data gathering service. Once you hit the threshold of even a slightly more complex project, it is likely that you will underestimate the work of keeping the service running at decent standard.
Here is where most build-it-yourself projects underestimate the work.
Ingesting a source is a one-off engineering task. Maintaining a source is a perpetual operational commitment, and the two are not comparable in scale.
Even small projects can get big, quickly
The fundamental problem is that you do not control any of your inputs. Publishers change their formats without warning. Portals get retired. New portals appear. Authentication mechanisms change. Schemas drift, a field that was always populated starts coming through empty, or a date format flips from ISO to a local convention, or a category code system gets revised. None of these changes generate an error message in your pipeline. They generate wrong data, or partial data, or no data at all from a source that yesterday was fine.
Worse, these failures are silent. A scraper that should be returning five hundred notices a week and starts returning fifty doesn't crash. It just under-reports. From the outside, your pipeline looks healthy. The dashboard is green. The records are flowing. But somewhere upstream, a publisher changed their pagination, or added a CAPTCHA, or restructured their results page, and you are now missing ninety percent of what you should be getting. You will probably find out when a user asks why a particular opportunity didn't appear, and by then the gap has been growing for weeks.
Here's the nub of the problem: no one can be certain that they've gathered every opportunity, but if you're running your own service you'll be expected to be 99.9% right 99% of the time and when you're wrong, you're going to be expected to know that you're wrong before your user does.
That's because people are relying on your data, the sales team, the bid writers, the strategy function, the leadership making decisions about which markets to enter, do not care that you captured ninety-nine percent of notices. They care about the one you missed, particularly if it was the one worth pursuing.
Don't misunderstand us, we're not going to tell you that we don't miss opportunities, we absolutely do, but because we've spent years on data harvesting that is comprehensive, data validation at the field level and monitoring on all of our scrapers every day, we have a better chance of spotting failures than a tool that is made from a couple of Claude scripts.
So the real work is not the ingestion engine. It is the operational capability that sits around it: the systems that know what each source should be producing, so they can detect when something is wrong; the validation that catches schema changes before users do; the monitoring that flags when a publisher's volumes drop unexpectedly; the team that picks up new portals as they appear and absorbs them; the field-level checks that catch silent corruption. This is the part of the work that has no end state. It runs forever, because the ground is constantly shifting under our feet.
We employ people whose full-time job is to maintain this capability. That is the part of the cost that is invisible from the outside, and it is the part that determines whether the pipeline you built in 2026 still works in 2029.
The standardisation dividend
Set the maintenance question aside for a moment, because there's a second layer of work that build-it-yourself teams often underestimate.
Procurement data, as it comes out of source systems, is inconsistent. Country names come through as "UK", "United Kingdom", "GB", "Great Britain" and "U.K." Currencies are sometimes labelled, sometimes implied. Timezones are usually missing. Dates arrive in half a dozen formats. The same notice might appear in two systems with different IDs and slightly different metadata.
Each of these inconsistencies is a small problem. Together they make the data difficult to query, difficult to aggregate, and difficult to trust. Standardisation is not glamorous work, but it is what makes everything downstream tractable. Clean, predictable categories mean you can search and filter by your chose industry. Properly resolved country codes mean geographic analysis works. Currency conversion captured at point of collection means historical values are comparable and searchable. The same goes for language, dates, timezones and buyer names.
If you are building this yourself, unless this IS your work, this is the work that gets deferred. Not because you're not capable but because it is hard to do at scale and hard to test.
The intelligence layer you can't reasonably build
There is one more layer of work that we think is worth being explicit about, because we don't think most teams could reasonably replicate it even if they wanted to.
Publishers classify their notices using the Common Procurement Vocabulary, a hierarchical category system. In theory, this is great. In practice, the codes publishers apply are often missing, often too broad to be useful, and often simply wrong. A digital transformation contract gets coded as "office equipment". A complex consultancy engagement gets coded as a single top-level code that tells you almost nothing about what's actually being procured.
We address this by running an augmentation layer that re-classifies notices based on their content, and assigns each augmented code a relevance score. This works across languages, a tender published in German, French or Italian gets the same treatment as one published in English. It works across borders, which means you can run a query for, say, cybersecurity-related opportunities and get a coherent result set whether the underlying notices are from the UK, Germany or Poland.
If you wanted to build this yourself, you would need a multilingual classification pipeline, a training corpus, a scoring methodology, and a way of keeping it tuned as the procurement vocabulary evolves. It is, in our view, the single largest piece of value-add in the platform, and the one that takes the longest to build from a standing start.
Then there's the document problem
One final thing. Finding a notice is the easy part. Getting the actual specification documents, the PDFs, the bid packs, the qualification questionnaires, can be where the real friction lives. Many government portals are greedy for your information and getting to documents can feel like a digital assault course.
If you're building your own data feed you can deliver someone a notice within minutes of publication, and they still can't get to work on it because the spec is locked behind a registration flow on a portal they've never used. Building reliable, automatic gathering of attachments across multiple publishing systems is, in our experience, more work than the original ingestion work which is why we offer it as a separate, managed capability.
What buying actually gets you
Here's what you should expect if you decide to pay for a procurement data API service.
- Coverage you can rely on. Not "we've got everything", nobody can honestly promise that. What you should get is a service that knows what each source ought to be producing, a robust approach to validation and proactive fixes that gives you confidence that the product is working as it should.
- New sources included. A good service is a growing service and when the database grows you should benefit. So you should see and know that the product is growing and you should get that data without having to pay more.
- Standards as standard. You should get standardised, queryable data. Country codes that meet the ISO standard. Currency converted at the point of collection so historical values mean something. Categories that work the same way whether the underlying notice is in English, German or Polish. None of this is glamorous but all of it matters.
- Search that works. A query for cybersecurity opportunities should return cybersecurity opportunities, whether the underlying notice is in the UK, France or Italy, and whether the publisher coded it correctly or not. Search really matters.
- Documents that help. You should be able to throw the documents at Claude and have it write, clean accurate code based on having good documents and the right information. Good documentation makes the job of maintaining your code so much easier.
- Predictable pricing. You should know what you're signing up for, not just the data, the standards and the infrastructure, but also a known price. It can be by use or by flat rate, but you should be able to know "if I run this service, this is going to be my bill". You have to speak to us about the exact price of our API but you see our pricing page here.
And you should get all of this without becoming an expert in procurement data publishing. That's the real exchange: you stop running scrapers, and you start running whatever it is your users actually need you to run. The routing logic, the workflow on top, the integrations with your CRM or bid management system, that's the bit where you add value. The ingestion and maintenance becomes someone else's problem.
A framework for deciding
So, here are the questions that matter:
- How many sources do you actually need to cover? If it's six or seven then build. If it's twenty, the maths changes. If you don't know yet, it is safe to assume that it will grow.
- How much does it cost you when you miss an opportunity? If the answer is "not much, we're using this for general market awareness", you have a lot of flexibility. If the answer is "a missed opportunity is a six-figure pipeline event", you need to think very carefully about whether you can guarantee comprehensive coverage on your own infrastructure.
- Do you have the operational appetite to maintain scrapers forever? Not just write them, but monitor them, fix them, replace them when portals retire, and absorb new ones when they appear? This is a real organisational commitment, and it's harder to fund internally than people expect, because it never produces a feature to demo.
- Do you need cross-border or cross-language analysis? If yes, the classification problem alone might make your build more complex than you anticipate.
- Do your users need the actual tender documents, or just the notices? If they need documents, factor in that this is a second, larger pipeline problem.
The honest answer for most teams is that they can build something that works for a few sources, not only that, with good use of LLMs they can build something much better than a lot of commercial offerings.
Where the build approach struggles is in year two, when the original engineers have moved on, the number of scrapers has doubled, those sources keep breaking and you're expected to write three new scrapers and users are complaining that they can't rely on your service.
That uncertainty is the thing that buying solves. Not the writing of scrapers, but the ability to keep them working, validate them, and absorb the changes that will inevitably get thrown at you whilst you're doing your actual job.
If you'd like to see what that looks like in practice, we'd be happy to give you access to the Open Opportunities API for a closer look. If you're not ready for the API, we've also written about how to buy a tender alert service.