Team included: System architects, Senior software developers,
QA engineers, Project manager, UI/UX Designer
SivinTech Group is a software development company with a strong team of senior experts, catering for its global clientele with innovative cost-effective solutions across different industry verticals.We take you from potential to performance, help you develop your business with planning, architecture, design, development, QA and customer support.
For Civic Journals client SivinTech has built a big data solution for integrating public statistics sources and increasing information transparency.
Growing attention to informational freedom creates new possibilities for open data initiatives all over the US. One such non-profit group was able to create an idea, develop it further and find support in state government structures to plan and start development of a product aimed at integrating data generated by many specialized institutions. Product aims at creating rich datasets useful to everyone from interested individuals to businesses, which can rely on this data for estimating, planning and decision making.
It appeared that a lot of data is generated across the time in various sources, and more and more of this data becomes public as a result of various initiatives and new data transparency laws. Such information is usually scattered, hard to find and even harder to interpret. Civic Journals carefully collects this information, processes, groups and normalizes, and makes it easily available to others. There is a lot of data, and along with interpretation speed, solution should also be:
- Durable. Non-profit organization wanted to minimize support efforts since they didn't want to support an army of DevOps and support engineers handling performance spikes and occasional failures.
- Verified. Continuous Integration approach should be used to constantly verify the implementation by running unit, functional and nightly tests, fully covering current implementation. A bug found during manual testing should be an exceptional event.
- Monitored. Necessary system statistics collection and monitoring should be done easily when needed.
- Scalable. New data is always coming, new datasets get added, and certain level of expected scalability should be put into initial system design.
- Cost-effective. While providing reasonable datasets processing latency, system should be able to support datasets size growth without seriously upping infrastructure maintenance costs.
Flexible. Only few data-providing organizations are willing to do additional work pushing data into the system. Civic Journals should be ready to do most of heavy lifting work by finding, downloading, parsing and processing the data in arbitrary format. Currently there are planty of different sources, many requiring unique behavior needed to get the data prepared for them.
Assuming data volumes along with needed tooling set and scalability requirements, Amazon AWS cloud platform was most appropriate choice for building the project. Amazon provides rich set of solutions for every need of a high loaded projects. More, Amazon solves many maintenance routines, which assuming proper set up and smart architecture, enables one-click scalability in reasonable margins.
Primary modules of the system are:
- Dataset fetchers. Integrated into a single workflow, many different pluggable modules are available to connect external data providers to the system. Some of them are ready to push their data directly to the system via an API, some publish machine-friendly files to their file servers, and some are just uploading HTML pages or pdf attachments on their website. A flexible workers infrastructure handles initial data acquisition.
- Dataset processors. Another kind of pluggable workers process raw data created during previous step. Parse documents, extract meaningful numbers and cook standard CJ datasets. This could be a simple solution, but often aggregation scenarios are rather non-trivial, since many CJ datasets are sometimes a product of one initial input batch, possibly passing several computational and grouping steps. Adding data volume requirements, flexible MapReduce solution had to be implemented – generally slower, but ultimately scalable distributed computational algorithm. Standalone workers are plugged at a certain steps of this algorithm, creating hybrid data cooking pipeline, able to solve the task in a generified easily adjustable manner.
- Hot Preview. MapReduce solutions come at a cost: while providing high throughput, it harms latency – newly available datasets are sometimes ready after a considerable delay. In order to level off this constraint, something usually referred to as λ-architecture was applied. It means that once new data comes into the system, processing workflow splits into two branches: one branch runs normal MapReduce flow, while other creates a preview of the result. This preview does not guarantee absolute accuracy, introducing some error margin or somehow incomplete representation, but gives a sneak peek into the end result, being eventually replaced with a 100%-accurate results of the full flow.
- Data storage. A few different database solution host the data. System transparently switches and moves data between hot and cold storage and a few caches, optimizing access times along with storage costs.
After project engineering and acceptance phases were complete, the solution was ready and went public. In a few following months many datasources were connected, proving that our implementation works and matches initial requirements. It is usually said that information is power, and in this case customer initiative group managed to use this power to make the world a bit better place, increasing transparency and making some specific knowledge available to a larger groups of people.