Quantcast
Channel: Archives des Troubleshooting - dbi Blog
Viewing all 44 articles
Browse latest View live
↧

ODA and KVM: Debugging of DBsystem creation failure

$
0
0

Debugging errors when working with ODA is not always the easiest thing do… šŸ˜›

It may become a bit tricky and not a straightforward process. In this blog I wanted to show you and example we faced with the debugging of a Dbsystem creation failure and how we found out the real reason it failed.

Before starting let’s do a short reminder about KVM virtualisation on ODA.

Since 19.9, ODA supports hard partitioning for database virtualisation on ODA. This works on a principle based on 2 types of VMs:

  1. Compute instance (more info here)
  2. DB Systems

While the first one is intended for traditional VM hosted any workload except oracle DBs, the second one is dedicated to database virtualisation.
A DB System is then an Oracle Linux with a similar stack than the ODAĀ  BM (GI, DB, …).

Provisioning a new DBSystem is pretty easy and straightforward using the command odaacli create-dbsystem and a JSON file as input…as long as it works…and you don’t do any mistake.

In our case, here the error we got when trying to create a new DB System:

Job details
----------------------------------------------------------------
                     ID:  75115716-4ce3-4eb1-af1a-4d3d8bef441a
            Description:  DB System srvdb01 creation
                 Status:  Failure
                Created:  November 5, 2021 11:37:48 AM CET
                Message:  DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.

Task Name                                Start Time                          End Time                            Status
---------------------------------------- ----------------------------------- ----------------------------------- ----------
Create DB System metadata                November 5, 2021 11:37:48 AM CET    November 5, 2021 11:37:48 AM CET    Success
Persist new DB System                    November 5, 2021 11:37:48 AM CET    November 5, 2021 11:37:48 AM CET    Success
Validate DB System prerequisites         November 5, 2021 11:37:48 AM CET    November 5, 2021 11:37:52 AM CET    Success
Setup DB System environment              November 5, 2021 11:37:52 AM CET    November 5, 2021 11:37:53 AM CET    Success
Create DB System ASM volume              November 5, 2021 11:37:53 AM CET    November 5, 2021 11:38:00 AM CET    Success
Create DB System ACFS filesystem         November 5, 2021 11:38:00 AM CET    November 5, 2021 11:38:09 AM CET    Success
Create DB System VM ACFS snapshots       November 5, 2021 11:38:09 AM CET    November 5, 2021 11:38:39 AM CET    Success
Create temporary SSH key pair            November 5, 2021 11:38:39 AM CET    November 5, 2021 11:38:39 AM CET    Success
Create DB System cloud-init config       November 5, 2021 11:38:39 AM CET    November 5, 2021 11:38:40 AM CET    Success
Provision DB System VM(s)                November 5, 2021 11:38:40 AM CET    November 5, 2021 11:38:41 AM CET    Success
Attach disks to DB System                November 5, 2021 11:38:41 AM CET    November 5, 2021 11:38:41 AM CET    Success
Add DB System to Clusterware             November 5, 2021 11:38:41 AM CET    November 5, 2021 11:38:41 AM CET    Success
Start DB System                          November 5, 2021 11:38:41 AM CET    November 5, 2021 11:38:44 AM CET    Success
Wait DB System VM first boot             November 5, 2021 11:38:44 AM CET    November 5, 2021 11:39:56 AM CET    Success
Setup Mutual TLS (mTLS)                  November 5, 2021 11:39:56 AM CET    November 5, 2021 11:40:15 AM CET    Success
Export clones repository                 November 5, 2021 11:40:15 AM CET    November 5, 2021 11:40:15 AM CET    Success
Setup ASM client cluster config          November 5, 2021 11:40:16 AM CET    November 5, 2021 11:40:18 AM CET    Success
Install DB System                        November 5, 2021 11:40:18 AM CET    November 5, 2021 11:40:26 AM CET    InternalError

So…it failed on installing DB into the newly creaated VM. Error code is: DCS-10001:Internal error

The first we tried is to get more info on this error code using dcserr:

[root@dbi-oda-x8 log]# dcserr 10001
10001, Internal_Error, "Internal error encountered: {0}."
// *Cause: An internal error occurred.
// *Action: Contact Oracle Support Services for assistance.
/

Not helping very much… Unfortunately the describe-job doesn’t give much more information about any kind of log file…

The only remaining solution is then to analyse the DCS log file. All operation we run using odacli are going through the dcsagent which generates a log in:

/opt/oracle/dcs/log

There you will find several types of log file such as the dcs-admin one or the dcs-components and obviously the dcs-agent log file

[root@dbi-oda-x8 log]# pwd
/opt/oracle/dcs/log
[root@dbi-oda-x8 log]# ls -l dcs-agent*
-rw-r--r-- 1 root root 144752279 Nov 3 23:30 dcs-agent-2021-11-03.log
-rw-r--r-- 1 root root 231235959 Nov 4 23:30 dcs-agent-2021-11-04.log
-rw-r--r-- 1 root root 151900 Nov 3 11:59 dcs-agent-requests-2021-11-03-03.log
-rw-r--r-- 1 root root 60331 Nov 3 12:59 dcs-agent-requests-2021-11-03-11.log
-rw-r--r-- 1 root root 122337 Nov 3 13:58 dcs-agent-requests-2021-11-03-13.log
-rw-r--r-- 1 root root 74029 Nov 3 14:59 dcs-agent-requests-2021-11-03-14.log
-rw-r--r-- 1 root root 112641 Nov 3 15:59 dcs-agent-requests-2021-11-03-15.log
-rw-r--r-- 1 root root 154503 Nov 3 16:59 dcs-agent-requests-2021-11-03-16.log
-rw-r--r-- 1 root root 10575 Nov 3 17:03 dcs-agent-requests-2021-11-03-17.log
-rw-r--r-- 1 root root 184 Nov 4 07:53 dcs-agent-requests-2021-11-04-07.log
-rw-r--r-- 1 root root 24097 Nov 4 08:42 dcs-agent-requests-2021-11-04-08.log
-rw-r--r-- 1 root root 6556 Nov 4 09:59 dcs-agent-requests-2021-11-04-09.log
-rw-r--r-- 1 root root 7711 Nov 4 10:56 dcs-agent-requests-2021-11-04-10.log
-rw-r--r-- 1 root root 17646 Nov 4 11:52 dcs-agent-requests-2021-11-04-11.log
-rw-r--r-- 1 root root 1837 Nov 4 12:58 dcs-agent-requests-2021-11-04-12.log
-rw-r--r-- 1 root root 122202 Nov 4 13:59 dcs-agent-requests-2021-11-04-13.log
-rw-r--r-- 1 root root 71837 Nov 4 14:59 dcs-agent-requests-2021-11-04-14.log
-rw-r--r-- 1 root root 215518 Nov 4 15:59 dcs-agent-requests-2021-11-04-15.log
-rw-r--r-- 1 root root 4497 Nov 4 16:24 dcs-agent-requests-2021-11-04-16.log
-rw-r--r-- 1 root root 660 Nov 5 07:56 dcs-agent-requests-2021-11-05-07.log
-rw-r--r-- 1 root root 513 Nov 5 08:00 dcs-agent-requests-2021-11-05-08.log
-rw-r--r-- 1 root root 45592 Nov 5 10:59 dcs-agent-requests-2021-11-05-10.log
-rw-r--r-- 1 root root 126945 Nov 5 11:59 dcs-agent-requests-2021-11-05-11.log
-rw-r--r-- 1 root root 17460 Nov 5 12:21 dcs-agent-requests.log
-rw-r--r-- 1 root root 75603907 Nov 5 12:21 dcs-agent.log

However the challenge is that this log file is pretty verbose and therefore pretty long.
Just to give you and idea, on our test ODA (where there were nothing much running) we had already almost 1 million rows in an half day.

So the option we used was to run a grep command to gather only the lines concerning the DB System we tried to create:

grep srvdb01 dcs-agent.log

…which still represents 850+ lines šŸ˜‰

Going bottom up, we found first all entries about the DELET DB SYSTEM we run after the failure, such as:

...
2021-11-05 11:47:50,962 INFO [dw-19811 - DELETE /dbsystem/srvdb01] [] c.o.d.a.k.o.l.SingleNodeLockController: Thread 'dw-19811 - DELETE /dbsystem/srvdb01' released READ lock for Resource type 'Metadata' with name 'metadata'
2021-11-05 11:47:50,963 INFO [dw-19811 - DELETE /dbsystem/srvdb01] [] c.o.d.a.k.m.KvmBaseModule: Starting new job 586fce36-8131-4f46-b447-36fab882f060 for taskFlow: seq(id: 586fce36-8131-4f46-b447-36fab882f060, name: 586fce36-8131-4f46-b447-36fab882f060, jobId: 586fce36-8131-4f46-b447-36fab882f060, status: Created,exposeTaskResultToJob: false, result: null, output: , on_failure: FailOnAny):
2021-11-05 11:47:50,963 INFO [dw-19811 - DELETE /dbsystem/srvdb01] [] c.o.d.a.k.m.KvmBaseModule: Job report: ServiceJobReport(jobId=586fce36-8131-4f46-b447-36fab882f060, status=Created, message=null, reports=[], createTimestamp=2021-11-05 11:47:50.957, resourceList=[], description=DB System srvdb01 deletion, updatedTime=2021-11-05 11:47:50.957)
  "description" : "DB System srvdb01 deletion",
  "description" : "DB System srvdb01 deletion",
2021-11-05 11:47:50,973 INFO [DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060] [] c.o.d.a.k.o.l.SingleNodeLockController: Thread 'DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060' trying to acquire WRITE lock for Resource type 'DB System' with name 'srvdb01'
2021-11-05 11:47:50,973 INFO [DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060] [] c.o.d.a.k.o.l.SingleNodeLockController: Thread 'DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060' acquired WRITE lock for Resource type 'DB System' with name 'srvdb01'
	 Mountpath: /u05/app/sharedrepo/srvdb01
...

So we could simply skip all lines containing DELET or Operation Type = Delete.

Then arrive plenty of lines which contains the error message you receive in the odacli describe-job as well as the content of the JSON file used to run the job.

...
2021-11-05 11:46:48,763 DEBUG [Process new DB System] [] c.o.d.a.k.t.KvmBaseTaskBuilder$KvmTaskExecutor: Output request: DbSystemCreateRequest(systemInfo=DbSystemCreateRequest.SystemInfo(dbSystemName=srvdb01, shapeName=odb2, cpuPoolName=cpupool4srv, diskGroup=DATA, systemPassword=*****, provisionType=rhp, timeZone=Europe/Zurich, enableRoleSeparation=true, customRoleSeparationInfo=DbSystemCreateRequest.CustomRoleSeparationInfo(groups=[DbSystemCreateRequest.GroupInfo(id=1001, role=oinstall, name=oinstall), DbSystemCreateRequest.GroupInfo(id=1002, role=dbaoper, name=dbaoper), DbSystemCreateRequest.GroupInfo(id=1003, role=dba, name=dba), DbSystemCreateRequest.GroupInfo(id=1004, role=asmadmin, name=asmadmin), DbSystemCreateRequest.GroupInfo(id=1005, role=asmoper, name=asmoper), DbSystemCreateRequest.GroupInfo(id=1006, role=asmdba, name=asmdba)], users=[DbSystemCreateRequest.UserInfo(id=1000, role=gridUser, name=grid), DbSystemCreateRequest.UserInfo(id=1001, role=oracleUser, name=oracle)])), networkInfo=DbSystemCreateRequest.NetworkInfo(domainName=dbi-lab.ch, ntpServers=[216.239.35.0], dnsServers=[8.8.8.8, 8.8.4.4], scanName=null, scanIps=null, nodes=[DbSystemCreateRequest.NetworkNodeInfo(number=0, name=srvdb01, ipAddress=10.36.0.245, netmask=255.255.255.0, gateway=10.36.0.1, vipName=null, vipAddress=null)], publicVNetwork=pubnet), gridInfo=DbSystemCreateRequest.GridInfo(language=en, enableAfd=false), dbInfo=DbSystemCreateRequest.DbInfo(name=srvTEST, uniqueName=srvTEST, domainName=dbi-lab.ch, adminPassword=**********, version=19.12.0.0.210720, edition=EE, type=SI, dbClass=OLTP, shape=odb2, role=PRIMARY, redundancy=MIRROR, characterSet=DbSystemCreateRequest.DbCharacterSetInfo(characterSet=AL32UTF8, nlsCharacterSet=AL16UTF16, dbTerritory=AMERICA, dbLanguage=ENGLISH), enableDbConsole=false, enableFlashStorage=false, enableFlashCache=false, enableSEHA=false, rmanBackupPassword=*****, level0BackupDay=null, tdePassword=*****, enableTde=false, enableUnifiedAuditing=true, isCdb=false, pdbName=null, pdbAdminUser=null, targetNodeNumber=null), devInfo=null)
2021-11-05 11:46:48,763 DEBUG [CreateDbSystem_KvmLockContainer_38327 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.c.t.TaskDetail: set task result as DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,763 INFO [CreateDbSystem_KvmLockContainer_38327 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.a.k.t.KvmBaseTaskBuilder$KvmLockContainer:  Task[id: CreateDbSystem_KvmLockContainer_38327, TaskName: CreateDbSystem_KvmLockContainer_38327] result: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,763 DEBUG [33793dd8-6704-407a-8dd0-f2b83a9deb10 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.c.t.TaskDetail: set task result as DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,763 DEBUG [33793dd8-6704-407a-8dd0-f2b83a9deb10 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.a.k.m.i.KvmJobHelper$KvmTaskReportRecorder: Recording job report: id: 33793dd8-6704-407a-8dd0-f2b83a9deb10, name: 33793dd8-6704-407a-8dd0-f2b83a9deb10, jobId: 33793dd8-6704-407a-8dd0-f2b83a9deb10, status: Failure,exposeTaskResultToJob: false, result: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''., output:
  "message" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",
  "description" : "DB System srvdb01 creation",
  "message" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",
  "description" : "DB System srvdb01 creation",
...

Still not much useful…so we skipped these too and continue our journey upward. Finally looking for the first (going up) line without any error, we could found in the next one the following message:

2021-11-05 11:46:47,948 INFO [dw-18140 - GET /instances/storage/dgSpace/ALL] [] c.o.i.a.IDMAgentAuthorizer: IDMAgentAuthorizer::user:ODA-srvdb01:role:list-dgstorages
! Causing: com.oracle.dcs.commons.exception.DcsException: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
! Causing: com.oracle.dcs.commons.exception.DcsException: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
! Causing: com.oracle.dcs.commons.exception.DcsException: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,745 DEBUG [Install DB System : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.a.k.m.i.KvmJobHelper$KvmTaskReportRecorder: Recording task report: id: CreateDbSystem_KvmTask_38345,name: Install DB System, jobId: 33793dd8-6704-407a-8dd0-f2b83a9deb10, status: InternalError,exposeTaskResultToJob: false, result: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.,output: DcsException{errorHttpCode=InternalError, msg=Internal error encountered: Error creating job 'Provision DB System 'srvdb01''., msgId=10001,causedBy=com.oracle.pic.commons.client.exceptions.RestClientException: DCS-11002:Password for database admin user does not comply with the password policy.}
  "taskResult" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",
  "taskResult" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",

Look at the 4th line šŸ˜‰ …yes at the end…scroll a bit more…here we go:

client.exceptions.RestClientException: DCS-11002:Password for database admin user does not comply with the password policy.}

Ā 

So finally the root cause of the failure was ā€œsimplyā€ that the password given for the sys/system accounts was not compliant… šŸ˜• šŸ˜•

However the remaining question is: Why don’t we get this error message back in the odacli describe-job instead of a useless generic error message??

It would have been so easier:

[root@dbi-oda-x8 log]# dcserr 11002
11002, Password_too_simple, "Password for {0} does not comply with the password policy."
// *Cause: The user provided password does not satisfy the password policy rules.
// *Action: Refer to the Deployment and User's Guide for the password policy.
//          Provide a password which meets the criteria.
/

I hope that this can help.

Enjoy! šŸ˜Ž

L’article ODA and KVM: Debugging of DBsystem creation failure est apparu en premier sur dbi Blog.

↧

How to delete a resource with the error: failed calling webhook

$
0
0

The original mistake

In preparation of the GitLab essentials workshop, I’m using helm to deploy it. After a few tests, I wanted to clean up my cluster, and accidentally deleted the namespace before doing helm uninstall. As a result, the namespace got stuck in the ā€œterminatingā€ state…

Troubleshooting

Now the namespace is stuck, but why ?

However, no resources seem to exist yet in the namespace:

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl get all -n gitlab
> No resources found

By default, GitLab installs the cert-manager controller, which comes with CRDs. However, the get all command does not return the CRDs:

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl get challenges.acme.cert-manager.io -n gitlab
NAME                                              STATE     DOMAIN                                   AGE
gitlab-gitlab-tls-c5nxj-1256604583-3239988248     invalid   gitlab-workshop.dbi-services.com     27m
gitlab-kas-tls-qghrb-3784695029-3983492218        invalid   kas-workshop.dbi-services.com        27m
gitlab-minio-tls-l8676-2620392232-3964581703      invalid   minio-workshop.dbi-services.com      27m
gitlab-registry-tls-k9j6n-1904257687-1249029966   invalid   registry-workshop.dbi-services.com   27m

CRDs delete does not work because the finalizer does not respond during deletion.

The easiest way to do this is to remove the finalizer from the resource:

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl patch challenges.acme.cert-manager.io/gitlab-gitlab-tls-c5nxj-1256604583-3239988248 --type=json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]' -n gitlab
> Error from server (InternalError): Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://gitlab-certmanager-webhook.gitlab.svc:443/mutate?timeout=10s": service "gitlab-certmanager-webhook" not found

Unfortunately, in this case, the patch doesn’t work because the delete of the namespace has removed some resources needed by the finalizer…

Solution

The cert-manager installs webhooks to manage CRDs:

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl get ValidatingWebhookConfiguration
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          81m

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl get MutatingWebhookConfiguration
NAME                         WEBHOOKS   AGE
gitlab-certmanager-webhook   1          81m

Webhooks call services and pods that no longer exist in our case. As a result, the webhook call fails and blocks the finalizer.

To correct the problem, simply delete the webhooks:

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl delete ValidatingWebhookConfiguration cert-manager-webhook
rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl delete MutatingWebhookConfiguration gitlab-certmanager-webhook

After that, it is possible to delete the remaining CRDs:

rocky@gitlab-master1:dbi-gitlab-ws:~$ kubectl patch challenges.acme.cert-manager.io/gitlab-gitlab-tls-c5nxj-1256604583-3239988248 --type=json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]' -n gitlab
> challenge.acme.cert-manager.io/gitlab-gitlab-tls-c5nxj-1256604583-3239988248 patched

The namespace will be automatically deleted once all CRDs have been cleaned.

L’article How to delete a resource with the error: failed calling webhook est apparu en premier sur dbi Blog.

↧
↧

How I killed my M-Files instance…and brought it back to life!

$
0
0
Oops it's broken

I work with M-Files since 1 year now, and to be honest the solution is pretty robust.
I had really few issues at ā€œServer levelā€, but sooner and later while playing sorcerer’s apprentice you finish by breaking something.

The inevitable happened, my M-Files instance became unusable, I assume the root cause is a combination of several things:

  • First I instanced too many Vaults on an undersized VM.
  • Then I restarted (violently) the host while M-Files was still working on some Vaults (using embedded Firebird DBs).

As a result, impossible to launch the Admin console: infinite loading when I wanted to list the Vaults.
Same behavior with the Desktop client and the Web Interface.

My first and, naive thought was:

ā€œOK, I applied the last monthly update and it might be related to thatā€.

But no one is talking about this problem in the community….so I need to check for something else.

Then I checked the logs, and I found some events like:

M-Files database error

I was in a situation where the snake bites its tail, the problem is services cannot be stopped gracefully because some Vaults are not responding properly.
The admin tool is also not responding, I can’t bring the Vault offline neither run a ā€œVerify and Repairā€.

It’s unusual to say that, but one of the problems with M-Files’ stability is that there are very few resources (admin guide or community threads) that talk about troubleshooting and recovery.


How I fixed it

I had then to ā€œimproviseā€ and find a way to move back in a stable state.

  1. First I changed the M-Files services startup mode to ā€œManualā€ and I restarted the host.
  2. After the reboot, I moved all the folders with Vault data to a temporary location.
  3. I changed back the services startup mode to ā€œAutomaticā€ and started M-Files.
  4. I was able to access the Admin tool and list the Vault (all flagged Offline as folders are missing)
  5. I moved back the Vault folder one by one, bring the Vault online and run the ā€œVerify and Repairā€ on each.
  6. Some of the Vaults required to be fixed:
Vault inconsistency

Luckily it worked

it's fixed!

Finally my M-Files server is back with all the Vaults running and without having to restore any backup.

In conclusion

This mishap highlighted me one thing that may have an impact. My server is hosted in the Cloud and stopped during the night. I moved the backup schedules during evening hours when the server is still up, but I forgot to re-schedule the Maintenance activities when the VM is running.

But the most important thing is to point out that it took me a year before I encountered a major incident. I have worked on several ECMs in recent years, and I remain impressed by the stability of M-Files.

Feel free to contact us for any question about M-Files.

L’article How I killed my M-Files instance…and brought it back to life! est apparu en premier sur dbi Blog.

↧

Oracle 21c: Attention Log – Useful or Superflous?

$
0
0

Attention.log is a feature, which was introduced in Oracle Database 21c, designed to capture high-level summaries of significant database events and activities. It differs from the alert.log in following points:

High-Level Summaries: The attention.log focuses on summarizing critical and significant events rather than logging every minor detail, including database startups and shutdowns, major configuration changes, errors or warnings that need immediate attention.

Consolidation of Critical Events: It provides a consolidated view of the most important events, making it easier for database administrators to quickly review and identify critical issues without rummaging through through detailed logs.

Accessibility: Designed to be easily readable and quickly accessible for a high-level overview of the database’s health and significant activities.

Complementary to Alert Log: While the attention.log highlights major events, it complements the alert.log rather than replacing it. Database administrators can use the attention.log for a quick overview and the alert.log for detailed diagnostics.

Location: Like the alert.log, the attention.log is also found in the DIAGNOSTIC_DEST directory, usually under $ORACLE_BASE/diag/rdbms/<db_name>/<instance_name>/log.

It can be very helpful for not so experienced database administrators or to get a quick overview in difficult or unexpected cases, as I had on a productive environment some time ago: an internal error had occurred, the database crashed and the alert.log was far too large to read without splitting it up which of course makes troubleshooting unnecessarily difficult (under time pressure).

How to get information about the attention.log:

The location of the attention log can be found by querying theĀ V$DIAG_INFOĀ view, it is in the same directory as the alert. log, (since Oracle 11g: $OH/diag/rdbms… )

select name, value
from   v$diag_info
where  name = 'Attention Log';

NAME                      VALUE
--------------------- -------------------------------------------------------------
Attention Log         /u01/app/oracle/diag/rdbms/cdb1/cdb1/trace/attention_cdb1.log

The oracle documentation proposes to query the V$DIAG_ALERT_EXT view to get relevant attention.log information, but it is a view over the XML-based alert.log (in the Automatic Diagnostic Repository for the current container), not the attention log! But nevertheless we can get very useful information out of it, divided into the same categories as in the attention.log:

--message_type 2=INCIDENT_ERROR, message_type 3=ERROR
SELECT message_type, message_level, message_text
FROM V$DIAG_ALERT_EXT 
WHERE message_type in (2, 3);

MESSAGE_TYPE MESSAGE_LEVEL MESSAGE_TEXT
------------ ------------- ---------------------------------------------------------
           3    4294967295 PMON (ospid: 3565): terminating the instance due to ORA error 471 

Querying V$DIAG_ALERT_EXT the most important labels are:

MESSAGE_LEVEL:

1:Ā CRITICAL: critical errors

2:Ā SEVERE: severe errors

8:Ā IMPORTANT: important message

16:Ā NORMAL: normal message

MESSAGE_TYPE:

1:Ā UNKNOWN: essentially the NULL type

2:Ā INCIDENT_ERROR: the program has encountered an error for some internal or unexpected reason, and it must be reported to Oracle Support

3:Ā ERROR: an error of some kind has occurred

4:Ā WARNING: an action occurred or a condition was discovered that should be reviewed and may require action

5:Ā NOTIFICATION: reports a normal action or event, this could be a user action such as ā€œlogon completedā€

6:Ā TRACE: output of a diagnostic trace

Opening the attention.log with vi

vi /u01/app/oracle/diag/rdbms/cdb1/cdb1/trace/attention_cdb1.log

will give you an output like this (JSON formatted), which is obviously pretty comfortable to read:

{
IMMEDIATE : "PMON (ospid: 3565): terminating the instance due to ORA error 471" 
CAUSE: "PMON detected fatal background process death"
ACTION: "Termination of fatal background is not recommended, Investigate cause of process termination"
CLASS : CDB-INSTANCE / CDB_ADMIN / ERROR / DBAL-35782660
TIME : 2024-07-10T14:15:16.159-07:00
INFO : "Some additional data on error PMON error"
}

It is possible to convert the output into plain text format:

jq -r '.tags[].name' input.json > output.txt

which gives us a formatted expression that might look like this:

2024-07-01T10:15:32.123456+00:00
[SEVERE] ORA-00600: internal error code, arguments: [1234], [], [], [], [], [], [], [], [], [], []
Action: Please contact Oracle Support Services.

2024-07-01T11:20:45.789012+00:00
[CRITICAL] ORA-01578: ORACLE data block corrupted (file # 23, block # 220734)
Action: This error signifies a corrupted data block. The data block has been marked as corrupt. Consider restoring from backup.

2024-07-02T08:42:27.654321+00:00
[ALERT] Database instance crashed due to unexpected termination.
Action: Investigate the cause of the instance termination. Review related logs and diagnostic information.

2024-07-03T12:34:56.987654+00:00
[INFO] System global area (SGA) resized. New size: 68GB.
Action: No immediate action required. Monitor performance and stability.

Key Points

Severity Levels: Entries are tagged with severity levels such as [SEVERE], [CRITICAL], [IMPORTANT] or [NORMAL] to highlight their importance.

Timestamp: Each entry begins with a timestamp in ISO 8601 format.

Messages and Actions: Each entry provides a brief description of the event and recommended actions.

Benefits for DBAs

Quick Identification: The attention.log helps DBAs quickly identify and respond to critical issues without sifting through the more detailed alert.log.

Conciseness: It captures only the most significant events, reducing noise and making it easier to focus on urgent matters.

Complementary to alert.log: It complements the alert.log by summarizing critical events, while the alert.log continues to provide detailed information for troubleshooting.

Overall, the attention.log is a useful addition for DBAs, enabling more efficient monitoring and quicker responses to significant database events.

https://oracle-base.com/articles/21c/attention-log-oracle-database-21c

https://blogs.oracle.com/cloud-infrastructure/post/alert-log-support-for-oci-database-management

https://docs.oracle.com/en/database/oracle/oracle-database/21/nfcon/management-solutions.html#GUID-F2EB58EC-4B22-473F-A2D3-40161372610E

L’article Oracle 21c: Attention Log – Useful or Superflous? est apparu en premier sur dbi Blog.

↧
Viewing all 44 articles
Browse latest View live