How statistics are calculated
We count how many offers each candidate received and for what salary. For example, if a Data QA developer with PyTest with a salary of $4,500 received 10 offers, then we would count him 10 times. If there were no offers, then he would not get into the statistics either.
The graph column is the total number of offers. This is not the number of vacancies, but an indicator of the level of demand. The more offers there are, the more companies try to hire such a specialist. 5k+ includes candidates with salaries >= $5,000 and < $5,500.
Median Salary Expectation – the weighted average of the market offer in the selected specialization, that is, the most frequent job offers for the selected specialization received by candidates. We do not count accepted or rejected offers.
Trending Data QA tech & tools in 2024
Data QA
What is Data Quality
A data quality analyst maintains an organisation’s data so that they can have confidence in the accuracy, completeness, consistency, trustworthiness, and availability of their data. DQA teams are in charge of conducting audits, defining the data quality standards, spotting outliers, and fixing the flaws, and play a key role at all stages in the data lifecycle. Without DQA work, strategic plans will fail, operations will go awry, customers will leave, and organisations will face substantial financial losses, as well as a lack of customer trust and potential legal repercussions due to poor-quality data.
This is a job that has changed as much as the hidden infrastructure that transforms data into insight and then powers the apps that we all use. I mean, it’s changed a lot.
Data Correctness/Validation
This is the largest stream of all the tasks. When we talk about data correctness, we should be asking: what does correctness mean to you, for this dataset? Because it would be different for every dataset and every organisation. The commonsense interpretation is that it must be what your end user (or business) wants from the dataset. Or what would be an expected result of the dataset.
We can obtain this just by asking questions, or else reading through the list of requirements. Here are some of the tests we might run, in this stream:
Finding Duplicates — nobody wants this in their data.
– Your data contains unique/distinct values in that column/field. Will the returned value be a unique/distinct value in that column/field?
– Any value that can be found in your data is returned.
Data with KPIs – If data has any columns we can sum, min or max on it’s called a key performance indicator. So basically any models which are mostly numeric/int column. eg: Budget, Revenue, Sales etc. If there is data comparison between two datasets then below tests applies:
– Comparing counts between two datasets — get the difference in count
– Compare the unique/distinct values and counts for columns – find out which values are not present in either of the datasets.
– Compare the KPIs between two datasets and get the percentage difference between them.
– Replace missing values – missing in any one of the datasets with primary or composite primary key. This can be done in a data source that does not have primary key too.
– Perform the metrics by segment for the individual column value — that can help you determine what might be going wrong if the count of values in the Zoopla-side doesn’t match the count on the Rightmove-side or if some of the values are missing.
Data Freshness
This is an easy set. How do we know if the data is fresh?
An obvious indication here is to check if your dataset has a date column, in which case, you just check the max date. Another one is, when the data was pulled into a particular table, all of this can be converted into a very simple automated checks, which we might talk about in a later blog entry.
Data Completeness
This could be an intermediate step in addition to data correctness, but how do we know to get there if the space of answers is complete?
To do this test, check if any column has all values null in it perhaps that’s okay, but most of the time it’s bad news.
Another test would be one-valuedness: whether everywhere on the column all values are the same, probably in some cases that would be a fine result, but probably in other cases that would be something we’d rather look into.
What are Data Quality Tools and How are They Used?
Data quality tools are used to improve, or sometimes automate, many processes required to ensure that data stays fit for analytics, data science, and machine learning. For example, such tools enable teams to evaluate their existing data pipelines, identify bottlenecks in quality, and even automate many remediation steps. Examples of activities relating to guaranteeing data quality include data profiling, data lineage, and data cleansing. Data cleansing, data profiling, measurement, and visualization tools can be used by teams to ‘understand the shape and values of the data assets that have been acquired – and how they are being collected’. These tools will call outliers and mixed formats. In the data analytics pipeline, data profiling acts as a quality control gate. And each of these are data management chores.
Where is PyTest used?
PyTest in the Wild: Wizards and Wands of Python Testing!
- Spellbinding CI/CD Rituals: PyTest conjures up a seamless pipeline to test your code spells before they go live. No more code curses in production!
- Plugin Potion Brewing: Brew your own enchanting PyTest potions! Plugins let developers expand their testing grimoire with magical extras.
- Parallel Potion Testing: Multiple cauldrons brewing at once? PyTest can handle lots of potions, I mean tests, in parallel. Get those results faster than a flying broomstick!
- Detective Work on Code: With its keen eye for detail, PyTest helps sleuth out those elusive bugs to keep the code realm safe and sound.
PyTest Alternatives
unittest
Built into Python's standard library, unittest is a unit testing framework inspired by JUnit. It supports test automation, sharing of setup and shutdown code, aggregation of tests into collections, and independence of the tests from the reporting framework.
Example:
import unittest
class SimpleTest(unittest.TestCase):
def test(self):
self.assertTrue(True)
if __name__ == '__main__':
unittest.main()
Pros:
- Included with Python, no installation required.
- Familiar xUnit style testing for those transitioning from other languages.
- Well-integrated with Python's development tools.
Cons:
- Less Pythonic, more boilerplate code compared to PyTest.
- Lacks some advanced features and plugins available in PyTest.
- Verbose and less flexible in writing test cases.
nosetests
Nose extends unittest to make testing easier. It supports fixtures and test discovery which allows tests to be written with less boilerplate.
Example:
def test_numbers():
assert 5 * 3 == 15
if __name__ == '__main__':
import nose
nose.run()
Pros:
- Simple to write tests due to less boilerplate compared to unittest.
- Automatic test discovery makes running tests easier.
- Highly extensible with a wide range of plugins.
Cons:
- Development of Nose has stalled; community has shifted towards Nose2 and PyTest.
- Not as feature-rich as PyTest for advanced testing needs.
- Some plugins may be outdated or lack maintenance.
doctest
The doctest module searches for pieces of text that look like interactive Python sessions, and then executes those sessions to verify that they work exactly as shown.
Example:
def multiply(a, b):
"""
>>> multiply(2, 3)
6
"""
return a * b
if __name__ == "__main__":
import doctest
doctest.testmod()
Pros:
- Encourages writing documentation and tests concurrently.
- Tests are readable as they are part of the documentation.
- Simple to use for straightforward test cases.
Cons:
- Not suitable for complex testing scenarios.
- Tests can become cluttered in documentation if overused.
- Limited to testing only what can be expressed in docs as interactive sessions.
Quick Facts about PyTest
The Curious Birth of PyTest
Once upon a time in 2004, there was a dev named Holger Krekel who unleashed the testing champion PyTest into the wild. As the prodigy of Python testing frameworks, it came with a simple assert rewriting charm that captured the hearts of bug-squashers everywhere. Swifter than a speeding exception, it forged a place in the annals of testing lore.
PyTest's Plugin Pandemonium
Roll up, roll up to witness the magnificent plugin fiesta! PyTest, like a ringmaster, commands an exuberant crowd of over 315 plugins. With these magnificent critters, one can extend its capabilities to the moon and back, ensuring that tests are not just run but are performed like a Cirque du Soleil of code.
# Here's how you'd typically use a plugin
pytest --verbose --capture=no # Look at me, using options!
Continuous Testimony with PyTest
Imagine a world where writing tests is as fun as slurping spaghetti. PyTest made that culinary code dream come true with its unique fixture model. Its continuous integration dazzle made it the darling of devs and the terror of bugs, effortlessly integrating with Jenkins, GitHub Actions, and more. With each commit, it whispers a gentle "Shall I test for thee?"
# Integrating PyTest with CI tools
# ...is as simple as adding a few lines to your config
# script: pytest -v
What is the difference between Junior, Middle, Senior and Expert PyTest developer?
Seniority Name | Years of Experience | Average Salary (USD/year) | Responsibilities & Activities |
---|---|---|---|
Junior PyTest Developer | 0-2 | 50,000 - 70,000 |
|
Middle PyTest Developer | 2-5 | 70,000 - 90,000 |
|
Senior PyTest Developer | 5-8 | 90,000 - 120,000 |
|
Expert/Team Lead PyTest Developer | 8+ | 120,000 - 150,000+ |
|
Top 10 PyTest Related Tech
Python Language
The almighty language at the heart of it all—Python! It's as essential as coffee is to programmers and as widely loved as cats are on the internet. You'll be scripting your tests with this easy-to-read, 'I-can't-believe-it's-not-English' syntax. Being bilingual is cool, but being Python-lingual is cooler.
PyTest Framework
If Python were the canvas, PyTest would be the paintbrush. It's the go-to framework for crafting your test masterpieces. Abundantly feature-packed, yet simpler than explaining why you need five minutes more sleep every time your alarm goes off.
import pytest
@pytest.mark.smoke
def test_the_obvious():
assert True is not Falsepytest-xdist
Why test one thing at a time when you can test ALL the things at once? pytest-xdist is like having extra arms to do more work. It's the octopus of plugins, letting you run tests in parallel, making your test suite go vroom!
pytest -n 4 # Run tests in four parallel gulp...I mean groups!
pytest-cov
Ever worried you're not covering enough test scenarios? Enter pytest-cov, a little like a nosy neighbor peering over your code's fence to ensure you've covered every nook & cranny. Code Coverage just got a bit less intimidating.
pytest --cov=your_package_name # Peeking time!
Selenium
For the adventurer in every tester—Selenium takes you on a quest through web applications, battling bugs and automating browsers. It is your Excalibur when you're venturing into the realm of GUI testing.
Requests
Need to nudge an API and see how it groans back? 'Requests' is your trusty messenger. It's as simple as requesting a pizza delivery—no toppings confusion included—and integral for API testing.
import requests
response = requests.get("https://api.ultimate-pizza.com/pizzas")Tox
Tox is like that overachiever who tests your package in different environments. It's a testament to the "works on my machine" philosophy, aiming to make this phrase as outdated as a beeper.
Mock and MonkeyPatch
Mock is like a stunt double for your code, taking the hits so your app doesn’t have to. MonkeyPatch, on the other hand, is like that cheeky chap who switches out the banana while no one's looking. Both are key for isolating tests from uncertainties.
Docker
Imagine you could pack your entire testing setup and move it anywhere without breaking anything—welcome to Docker, the virtual Tupperware for your apps. Perfect for making sure PyTest runs in a completely controlled chaos...erm, I mean, environment.
Git and GitHub
Last but not least, Git along with GitHub are the dynamic duo for source control. Imagine a world where you magically never lose your code, can collaborate without overwriting each other's work like a bad game of Tetris—that's Git for you.