Help:Monitoring Data Quality

(Difference between revisions)
Watchers
Revision as of 17:43, 8 August 2022 (edit)
DataAnalyst (Talk | contribs)
(About the Data Quality Issues page)
← Previous diff
Current revision (17:14, 20 August 2022) (edit)
DataAnalyst (Talk | contribs)
(replaced with updates from Prod)
 
Line 5: Line 5:
==Description and definitions== ==Description and definitions==
-* Data quality issues are identified by a job that runs periodically. The Data Quality Issues page shows the results from the last time the job was run. The run date/time is displayed at the top of the page. Note that the data is as much as 12 hours older than the run date/time due to the timing of the processing.+* Data quality issues are identified by a job that runs periodically. The Data Quality Issues page shows the results from the last time the job was run. The run date/time is displayed at the top of the page. Note that the data is as much as 8 hours older than the run date/time due to the timing of the processing.
** You cannot request a real-time issue check. If you just added or changed some data, you'll have to wait for the next run (or even the one after) to check for issues. ** You cannot request a real-time issue check. If you just added or changed some data, you'll have to wait for the next run (or even the one after) to check for issues.
** If you check an issue and the data doesn't match the message, check the page history to see if someone else fixed the issue within the last day or so. ** If you check an issue and the data doesn't match the message, check the page history to see if someone else fixed the issue within the last day or so.
Line 17: Line 17:
==Interacting with the list== ==Interacting with the list==
-Click the links on the list to see and correct issues. In addition, you can:+Click the links on the list to see and correct issues (see section [[Help:Monitoring_Data_Quality#Fixing_issues|'''Fixing issues''']] below).
-* Mark an anomaly as verified by clicking the "Verified by me" button. This means that you have reviewed the situation and determined that the data is correct - for example, that a person was truly born before their biological parents married (i.e., was not from a previous marriage/relationship of one of the parents).+ 
-** Before marking an issue as verified, ensure that the page (or a related page, such as the family page) has the sources that prove the information to be correct.+Alternately, you can:
-** When you select the "Verified by me" button, a template is added to the Talk page of the indicated Person or Family page (the Talk page will be automatically created if it doesn't already exist). This template identifies you and the date you clicked the "Verified by me" button. Others will see this information when they open the Talk page.+* Mark an anomaly as verified by clicking the "Verified" button. This means that you have reviewed the situation and determined that the data is correct - for example, that a person was truly born before their biological parents married (i.e., was not from a previous marriage/relationship of one of the parents).
 +** Before marking an issue as verified, ensure that the page (and the related page, such as the family page) has the sources that prove the information to be correct. Note that if the anomaly involves 2 dates (e.g., mother's age at the birth of a child depends on both the mother's birth date and the child's birth date), '''you must ensure that sources are provided for BOTH dates''' as part of the verification process.
 +** Keep in mind that most situations identified as anomalies in the early stages of data cleansing are, in fact, errors. For example, it is rare for a woman to give birth over the age of 50 (4 per 100,000 births in the US between 1997 and 1999 - see [https://en.wikipedia.org/wiki/Pregnancy_over_age_50 Pregnancy over age 50]), and any such situation should call into question the birth date of the child and the birth date or identity of the mother. Only after verifying '''all the facts''' should an anomaly be marked as verified.
 +** When you select the "Verified" button, a template is added to the Talk page of the indicated Person or Family page (the Talk page will be automatically created if it doesn't already exist). You will have an opportunity to add a comment (e.g., "see notes on parents' family page", "based on cited sources"). The template identifies you and the date you clicked the "Verified" button, and includes the comment. Others will see this information when they open the Talk page.
* Defer an issue by clicking the "Defer" button. This allows you to track issues that you are not prepared to address just yet or maybe ever. * Defer an issue by clicking the "Defer" button. This allows you to track issues that you are not prepared to address just yet or maybe ever.
** For example: ** For example:
Line 26: Line 29:
*** Maybe you need to ask a family member for the correct data and are waiting for a reply. *** Maybe you need to ask a family member for the correct data and are waiting for a reply.
*** Maybe you don't have the necessary expertise or access to sources to resolve the issue. *** Maybe you don't have the necessary expertise or access to sources to resolve the issue.
-*** Maybe there are conflicting sources (such as one source saying the christening date was 6 Apr 1635 and another source saying the birth date was 3 Apr 1636), and you don't believe the issue is resolvable unless new sources are found. You might make a judgment call that the issue (event before birth) can simply be ignored.+*** Maybe there are conflicting sources (such as one source saying the christening date was 6 Apr 1635 and another source saying the birth date was 3 Apr 1636), and you don't believe the issue is resolvable unless new sources are found. You might make a judgment call that the issue (events out of order) can simply be ignored.
** When you select the "Defer" button, a template is added to the Talk page of the indicated Person or Family page (the Talk page will be automatically created if it doesn't already exist). You will have an opportunity to add a comment (e.g., "conflated persons", "waiting for a reply", "conflicting sources; issue can be ignored"). The template identifies you and the date you clicked the "Defer" button, and includes the comment. Others will see this information when they open the Talk page. ** When you select the "Defer" button, a template is added to the Talk page of the indicated Person or Family page (the Talk page will be automatically created if it doesn't already exist). You will have an opportunity to add a comment (e.g., "conflated persons", "waiting for a reply", "conflicting sources; issue can be ignored"). The template identifies you and the date you clicked the "Defer" button, and includes the comment. Others will see this information when they open the Talk page.
-===Why the "defer" button?===+Just to be clear, you would normally take one of 3 actions:
-The "defer" button allows users to keep track of issues they choose to ignore for now so that they can optimize their data correction efforts. Additionally, it is intended to ensure that users don't mark anomalies as "verified" simply to get them off the list, which can be tempting. This is an opportunity to say "I don't know whether or not the data is correct" but still track that the issue was looked at. Maybe someone else will be able to resolve the issue, or maybe the issue will remain unresolved due to conflicting or limited sources.+* fix the data
 +* verify that the existing data is correct and sources support it, and select the "Verified" button
 +* select the "Defer" button
 +If you have resolved the issue by fixing the data, please '''don't select''' either the "Verified" or the "Defer" button, as this will just be confusing when looking at the Talk page in the future.
 + 
 +===Why the "Defer" button?===
 +The "Defer" button allows users to keep track of issues they choose to ignore for now so that they can optimize their data correction efforts. Additionally, it is intended to ensure that users don't mark anomalies as "verified" simply to get them off the list, which can be tempting. This is an opportunity to say "I don't know whether or not the data is correct" but still track that the issue was looked at. Maybe someone else will be able to resolve the issue, or maybe the issue will remain unresolved due to conflicting or limited sources.
==Filtering the list== ==Filtering the list==
Line 66: Line 75:
==Talk page== ==Talk page==
-There is a Talk page associated with the Data Quality Issues page, although the link is not in the normal place. Look for it after the date the data was last updated. The Talk page can be used to coordinate data correction efforts, and to discuss usability of the Data Quality Issues page.+There is a [[Talk:Data Quality Issues|Talk page]] associated with the Data Quality Issues page, although the link is not in the normal place. Look for it after the date the data was last updated. The Talk page can be used to coordinate data correction efforts, and to discuss usability of the Data Quality Issues page.

Current revision

Contents

About the Data Quality Issues page

WeRelate allows you to check for possible errors in your data by visiting the Data Quality Issues page.

Eventually the Data Quality Issues page will also support the Data Quality Patrol function - routine monitoring to catch errors such as typos so they can be fixed while the contributor is focused on that part of the tree. This routine monitoring will become feasible once the backlog of existing errors and anomalies is reduced. (Statistics on the backlog are available on the Data Quality Statistics page.) Please consider volunteering to address the backlog to make this possible.

Description and definitions

  • Data quality issues are identified by a job that runs periodically. The Data Quality Issues page shows the results from the last time the job was run. The run date/time is displayed at the top of the page. Note that the data is as much as 8 hours older than the run date/time due to the timing of the processing.
    • You cannot request a real-time issue check. If you just added or changed some data, you'll have to wait for the next run (or even the one after) to check for issues.
    • If you check an issue and the data doesn't match the message, check the page history to see if someone else fixed the issue within the last day or so.
  • Issues may be:
    • Anomalies - situations that are unusual enough to warrant review but might be correct, such as a person who married at age 6 or a person who was born before their parents were married
    • Errors - situations that are not correct, such as a person who married after they died, or a person who was born before a parent was born
    • Incomplete data - situations where minimal data about a person, such as gender, is missing
  • Note
    • Situations where sources are missing or incomplete might be added to this list in the future (or possibly a separate list)

Interacting with the list

Click the links on the list to see and correct issues (see section Fixing issues below).

Alternately, you can:

  • Mark an anomaly as verified by clicking the "Verified" button. This means that you have reviewed the situation and determined that the data is correct - for example, that a person was truly born before their biological parents married (i.e., was not from a previous marriage/relationship of one of the parents).
    • Before marking an issue as verified, ensure that the page (and the related page, such as the family page) has the sources that prove the information to be correct. Note that if the anomaly involves 2 dates (e.g., mother's age at the birth of a child depends on both the mother's birth date and the child's birth date), you must ensure that sources are provided for BOTH dates as part of the verification process.
    • Keep in mind that most situations identified as anomalies in the early stages of data cleansing are, in fact, errors. For example, it is rare for a woman to give birth over the age of 50 (4 per 100,000 births in the US between 1997 and 1999 - see Pregnancy over age 50), and any such situation should call into question the birth date of the child and the birth date or identity of the mother. Only after verifying all the facts should an anomaly be marked as verified.
    • When you select the "Verified" button, a template is added to the Talk page of the indicated Person or Family page (the Talk page will be automatically created if it doesn't already exist). You will have an opportunity to add a comment (e.g., "see notes on parents' family page", "based on cited sources"). The template identifies you and the date you clicked the "Verified" button, and includes the comment. Others will see this information when they open the Talk page.
  • Defer an issue by clicking the "Defer" button. This allows you to track issues that you are not prepared to address just yet or maybe ever.
    • For example:
      • Maybe you are working on your own project but choose to clean up a few issues each day, and are looking for "low-hanging fruit" such as simple date typos. You might want to defer larger problems such as a page that conflates 2 individuals until you are prepared to devote the time required for the necessary research.
      • Maybe you need to ask a family member for the correct data and are waiting for a reply.
      • Maybe you don't have the necessary expertise or access to sources to resolve the issue.
      • Maybe there are conflicting sources (such as one source saying the christening date was 6 Apr 1635 and another source saying the birth date was 3 Apr 1636), and you don't believe the issue is resolvable unless new sources are found. You might make a judgment call that the issue (events out of order) can simply be ignored.
    • When you select the "Defer" button, a template is added to the Talk page of the indicated Person or Family page (the Talk page will be automatically created if it doesn't already exist). You will have an opportunity to add a comment (e.g., "conflated persons", "waiting for a reply", "conflicting sources; issue can be ignored"). The template identifies you and the date you clicked the "Defer" button, and includes the comment. Others will see this information when they open the Talk page.

Just to be clear, you would normally take one of 3 actions:

  • fix the data
  • verify that the existing data is correct and sources support it, and select the "Verified" button
  • select the "Defer" button

If you have resolved the issue by fixing the data, please don't select either the "Verified" or the "Defer" button, as this will just be confusing when looking at the Talk page in the future.

Why the "Defer" button?

The "Defer" button allows users to keep track of issues they choose to ignore for now so that they can optimize their data correction efforts. Additionally, it is intended to ensure that users don't mark anomalies as "verified" simply to get them off the list, which can be tempting. This is an opportunity to say "I don't know whether or not the data is correct" but still track that the issue was looked at. Maybe someone else will be able to resolve the issue, or maybe the issue will remain unresolved due to conflicting or limited sources.

Filtering the list

The list is automatically filtered when you first open it:

  • If you select Data Quality Issues from the My Relate menu, the list reflects your watchlist.
  • If you select check beside a MyTree on the Manage My Trees page, the list reflects that MyTree.
  • If you select Data Quality patrol from the Volunteer portal, the list shows issues across the entire database.
  • In all cases, when you first open the list, verified anomalies are excluded.

In addition, you can filter the list by:

  • category (anomalies, errors, incomplete data)
  • century of birth year as stated on the Person page (keeping in mind it might be incorrect). Note that the birth year isn't considered if it includes "bet/and", "bef", or "aft") and Family pages aren't included at all. This filter is primarily intended to assist volunteers addressing the initial backlog of issues.

You can also choose to include verified anomalies.

  • If you choose to include verified anomalies, the list will indicate who verified each anomaly. An anomaly can be verified by more than one user - in fact, a second set of eyes can increase the reliability of the data, since everyone makes mistakes at some point.

If you are signed in, you can switch between showing one MyTree, your entire watchlist, or the entire database:

  • For performance reasons, the MyTrees and watched/unwatched filters can't be used at the same time. If you want to filter on a MyTree, make sure you have selected Watched and unwatched. If you want to filter on your watchlist, make sure the MyTree filter is set to Whether or not in.

For performance reasons, when you filter on a MyTree or your watchlist, the system will restrict the number of issues displayed at a time. This is automatic, and you will be informed of the limit. Expect it to take several seconds for the list to appear.

Also for performance reasons, you cannot filter out deferred issues. Instead, any issue that you previously deferred is noted as such (if you are signed in).

List order

The list is in alphabetical order: Person pages by last name, first name followed by Family pages by husband last name, first name, wife last name, first name.

Fixing issues

Some issues, such as date typos, can be resolved by checking sources cited on the page. Others require some research. If you find a source to support a correction, please add the source to the page.

If there is an obvious typo in a date, please don't assume that it is just the century or the decade that is wrong. It is common to accidentally reverse the century and the decade, or to repeat one digit when repeating another digit was intended. For example:

  • don't assume that 1984 should be 1884 - maybe it should be 1894
  • don't assume that 1883 should be 1783 - maybe it should be 1773

Talk page

There is a Talk page associated with the Data Quality Issues page, although the link is not in the normal place. Look for it after the date the data was last updated. The Talk page can be used to coordinate data correction efforts, and to discuss usability of the Data Quality Issues page.