Need help optimizing database

Forum for things that doesn't really have anything to do with hMailServer. Such as php.ini, beer, etc etc.
Post Reply
palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Need help optimizing database

Post by palinka » 2020-02-13 00:04

As part of my firewall ban project I have a table for information scraped from the firewall log. It has 2 columns: timestamp and IP. Each of those columns purposefully contain duplicate information. For example, an IP can attempt to connect multiple times and therefore be duplicated the same number of times. The firewall log timestamp only resolves to seconds (not milliseconds) so its possible to have duplicate entries for that as well.

As I approach a million rows, its starting to get very slow. I cannot think of a way (shocking, I know) to optimize the table because I really need these duplicate data for counting purposes. Ideally, I'd like to be able to split them somehow to make a unique column in order to speed up queries.

Any ideas?

User avatar
mattg
Moderator
Moderator
Posts: 22437
Joined: 2007-06-14 05:12
Location: 'The Outback' Australia

Re: Need help optimizing database

Post by mattg » 2020-02-13 00:26

unique entries and a third column for count
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-13 04:09

mattg wrote:
2020-02-13 00:26
unique entries and a third column for count
I'm trying to visualize this.

Table 1:
id AUTOINCREMENT PRIMARY UNIQUE
ipaddress UNIQUE
ipcount INT
last-timestamp NOT NULL
Trigger = insert into table 2 id, timestamp

Table 2:
id (matches table 1 id)(not unique or primary - duplicates allowed)
timestamp (not unique or primary - duplicates allowed)

Don't I still end up in the same place with a million rows without a unique column in table 2? Seems like what I have is slightly more efficient than adding more columns and tables that effectively do the same thing. I know I'm missing something here. I just can't wrap my head around the missing link.

tunis
Senior user
Senior user
Posts: 351
Joined: 2015-01-05 20:22
Location: Sweden

Re: Need help optimizing database

Post by tunis » 2020-02-13 09:08

All columns you use in query you should have index on.
If you use "select x where timestamp > 'y' order by id" timestamp and id should have index on them to speed up query.
HMS 5.6.8 B2534.28 on Windows Server 2019 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.

User avatar
jimimaseye
Moderator
Moderator
Posts: 10060
Joined: 2011-09-08 17:48

Re: Need help optimizing database

Post by jimimaseye » 2020-02-13 09:50

tunis wrote:
2020-02-13 09:08
All columns you use in query you should have index on.
If you use "select x where timestamp > 'y' order by id" timestamp and id should have index on them to speed up query.
+1

Eg

CREATE INDEX tmp_hm_fwban_timestamp ON hm_fwban (timestamp);

[Entered by mobile. Excuse my spelling.]
5.7 on test.
SpamassassinForWindows 3.4.0 spamd service
AV: Clamwin + Clamd service + sanesecurity defs : https://www.hmailserver.com/forum/viewtopic.php?f=21&t=26829

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Need help optimizing database

Post by Dravion » 2020-02-13 12:57

First of all, use SQL Prepared Statements.

If you insert dynamic string variables in your SQL-Statements, you force the Database Query Planer to Re-Evaluate every new SQL-Query which needs to be executed
while Prepared Statements needs to be calculated and compiled only once.

You also should avoid Table scan (full search of a table)
instead of Index Scans which are in most cases faster by creating Indexes.Avoid Foreign Keys Relations.

Reduce your PHP Result sets by only fetching the Data you need, not long lists of Data.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-13 13:08

Dravion wrote:
2020-02-13 12:57

You also should avoid Table scan (full search of a table)
That is definitely what I'm doing now. :(

Thanks for the ideas. When I get to my computer I'm going to try a couple of things and report back.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-23 16:09

tunis wrote:
2020-02-13 09:08
All columns you use in query you should have index on.
If you use "select x where timestamp > 'y' order by id" timestamp and id should have index on them to speed up query.
I get that, but that's exactly the heart of my problem. Its impossible for either timestamp or ipaddress to be unique. Both NECESSARILY must be able to contain duplicates in order to get an accurate count - for the reasons I laid out in the OP.

Here's what I did so far:

hm_fwban_blocks_ip
id AUTOINCREMENT PRIMARY UNIQUE
ipaddress UNIQUE
hits INT (ON DUPLICATE KEY UPDATE hits=(hits+1))
lasttimestamp NOT NULL
Trigger = insert into table 2 id, ipaddress, timestamp

hm_fwban_rh (this is the original one that is causing all the problems)
ipid (matches table 1 id)(not unique or primary - duplicates allowed)
ipaddress (not unique or primary - duplicates allowed)
timestamp (not unique or primary - duplicates allowed)

This setup allows for minimal information about the total number of dropped connections to come from hm_fwban_blocks_ip - which, with primary and unique columns, returns results very quickly. But I still have the original problem when I need to draw more detailed data from hm_fwban_rh.

What if I split hm_fwban_rh into 3 tables? One for ipaddress, one for timestamp, and one for ????unique identifier???? I can't figure out how to bind them together without ending up with the same all-duplicates table.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-23 17:23

OK, I thought of a way to fudge milliseconds in timestamp column in order to make it UNIQUE: just add milliseconds to the insert statement. This works if the column is formatted as BIGINT.

1) read firewall log, get timestamp (2020-02-23 10:18:43)
2) add ".001" to the end (2020-02-23 10:18:43.001)
3) INSERT INTO test_table (timestamp) VALUES (UNIX_TIMESTAMP('2020-02-23 12:22:44.001') * 1000)

OK, all of that is no problem. Now, if there is a duplicate timestamp, I need to add 1. I can't find the syntax for this operation. Unlike ON DUPLICATE KEY UPDATE, I want to insert a new value: timestamp=(timestamp+1)

ON DUPLICATE KEY UPDATE only updates the record with the same key. I need to add a new record. Is this possible?

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-23 18:53

I made the unique timestamp thing work using powershell instead of directly in the query. Running "foreach" update now. Will let you know if this speeds things up in a few hours when its done. :mrgreen:

tunis
Senior user
Senior user
Posts: 351
Joined: 2015-01-05 20:22
Location: Sweden

Re: Need help optimizing database

Post by tunis » 2020-02-24 09:38

You don´t need to add unique index.

ALTER TABLE tbl_name ADD INDEX index_name (column_list) − This adds an ordinary index in which any value may appear more than once.
HMS 5.6.8 B2534.28 on Windows Server 2019 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.

User avatar
jimimaseye
Moderator
Moderator
Posts: 10060
Joined: 2011-09-08 17:48

Re: Need help optimizing database

Post by jimimaseye » 2020-02-24 09:51

(I suggested this a week and half ago. Not sure why he chose to not do).
jimimaseye wrote:
2020-02-13 09:50
[Entered by mobile. Excuse my spelling.]
5.7 on test.
SpamassassinForWindows 3.4.0 spamd service
AV: Clamwin + Clamd service + sanesecurity defs : https://www.hmailserver.com/forum/viewtopic.php?f=21&t=26829

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-24 13:05

tunis wrote:
2020-02-24 09:38
You don´t need to add unique index.

ALTER TABLE tbl_name ADD INDEX index_name (column_list) − This adds an ordinary index in which any value may appear more than once.
I was not aware of that. I misunderstood and thought it must be unique. I'll give that a shot. Thanks.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-24 16:57

OK I indexed both ipaddress and timestamp, here's a typical query:

Code: Select all

SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE timestamp LIKE '2020-02%' GROUP BY ipaddress
And results: (1550 total, Query took 0.3457 seconds.) <-- from phpmyadmin

However, the reality is it takes over 20 seconds. I'm not sure what the disconnect is.

Buuuuttt... Searches on ipaddress are now lightning fast (as well as page loads). I'm really confused over the whole thing. Neither ipaddress nor timestamp are unique. There are a helluva lot more "non-duplicated" timestamps than there are ipaddresses.

tunis
Senior user
Senior user
Posts: 351
Joined: 2015-01-05 20:22
Location: Sweden

Re: Need help optimizing database

Post by tunis » 2020-02-24 18:21

palinka wrote:
2020-02-24 16:57
OK I indexed both ipaddress and timestamp, here's a typical query:

Code: Select all

SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE timestamp LIKE '2020-02%' GROUP BY ipaddress
And results: (1550 total, Query took 0.3457 seconds.) <-- from phpmyadmin

However, the reality is it takes over 20 seconds. I'm not sure what the disconnect is.

Buuuuttt... Searches on ipaddress are now lightning fast (as well as page loads). I'm really confused over the whole thing. Neither ipaddress nor timestamp are unique. There are a helluva lot more "non-duplicated" timestamps than there are ipaddresses.
For more speed you should avoid using %.
Can't you using LEFT (string, length)

Code: Select all

SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE LEFT(timestamp,7) = '2020-02' GROUP BY ipaddress
This is untested
HMS 5.6.8 B2534.28 on Windows Server 2019 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-24 18:41

tunis wrote:
2020-02-24 18:21
For more speed you should avoid using %.
Can't you using LEFT (string, length)

Code: Select all

SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE LEFT(timestamp,7) = '2020-02' GROUP BY ipaddress
This is untested
Not the way it currently is. Would need a page redesign, but yeah, I'm willing to do whatever it takes to speed things up.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-25 03:21

tunis wrote:
2020-02-24 18:21
For more speed you should avoid using %.
Can't you using LEFT (string, length)

Code: Select all

SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE LEFT(timestamp,7) = '2020-02' GROUP BY ipaddress
This is untested
I tried it and it doesn't help. At least, its not noticeable.

Also, I experimented a little. After the index suggestion, I put it on both ipaddress and timestamp and that made ip queries super fast but timestamp queries really slow - like 20 seconds. Then I dropped the index from timestamp and got the same results. Then I dropped the index on ipaddress and put it back on timestamp. That made the whole thing a little faster than before.

Fortunately, because I run a demo site with separate data (and unaltered columns), I was able to compare speed.

Old: no indexes
New: index on timestamp

Result: with index on timestamp, pages load in about 2 seconds. The demo loads the same pages in about 4 seconds. So that's definitely an improvement, but I'm looking to speed it up even faster, if possible.

User avatar
mattg
Moderator
Moderator
Posts: 22437
Joined: 2007-06-14 05:12
Location: 'The Outback' Australia

Re: Need help optimizing database

Post by mattg » 2020-02-25 05:01

What about indexing BOTH columns together
https://www.mysqltutorial.org/mysql-ind ... ite-index/
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-25 13:16

mattg wrote:
2020-02-25 05:01
What about indexing BOTH columns together
https://www.mysqltutorial.org/mysql-ind ... ite-index/
I'll give that a careful read.

in the meantime, this query solves half my problem. it's like a magical speed incantation.

Code: Select all

SELECT 
	ipaddress, 
	timestamp 
FROM (
	SELECT * 
	FROM hm_fwban_rh 
	WHERE '2020-01-01 00:00:00' <= timestamp
) AS A 
WHERE timestamp < '2020-02-01 00:00:00'
its all about the WHERE, I guess. About half of the most used queries are listing or counting within a timespan and this works great.

I'm going to research query structure for just plain counting, which is the bulk of the other most used queries.

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Need help optimizing database

Post by Dravion » 2020-02-25 16:49

cast your TIMESTAMP string to TIMESTAMP Datatype before you compare Date and Time like this:

< STR_TO_DATE('2012-12-23 18:00:00','YYYY-MM-DD HH:MI:SS')


Next, use ALLWAYS PREPARED STATEMENTS!
Avoid Dynamic SQL like WHERE = 'blah'

Dynamic SQL always is forcing to re-anaylse your whole SQL-Query by the Query Optimizer which comes with hugh costs.
Prepared Statements on the other side doesn't needs to be re-analyzed and optimized over and over again and is way faster.

//prepared

Code: Select all

$stmt = $pdo->prepare("SELECT * FROM users WHERE id=?");

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-25 18:58

Dravion wrote:
2020-02-25 16:49
cast your TIMESTAMP string to TIMESTAMP Datatype before you compare Date and Time like this:

< STR_TO_DATE('2012-12-23 18:00:00','YYYY-MM-DD HH:MI:SS')
Awesome. Thank you.

Many of my queries use DATE(timestamp). Is there any more efficient way of doing this?
Next, use ALLWAYS PREPARED STATEMENTS!

Code: Select all

$stmt = $pdo->prepare("SELECT * FROM users WHERE id=?");
Already do.


Another issue I have that slows things down (I think) is DISTINCT for searching ipadress results. Is that the fastest method of finding distinct entries? Mainly its used for COUNT(DISTINCT(ipaddress)).

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Need help optimizing database

Post by Dravion » 2020-02-25 19:23

Check your Slow Query log and check which takes longer than 500 ms.
Any under 500 ms is good, any above must be optimized.

https://serverpilot.io/docs/how-to-read ... query-log/

you can also use EXPLAIN command before your SQL-Query to get to know which part of your SQL-Statement is running a long time.

palinka
Senior user
Senior user
Posts: 4461
Joined: 2017-09-12 17:57

Re: Need help optimizing database

Post by palinka » 2020-02-26 14:43

Dravion wrote:
2020-02-25 16:49
cast your TIMESTAMP string to TIMESTAMP Datatype before you compare Date and Time like this:

< STR_TO_DATE('2012-12-23 18:00:00','YYYY-MM-DD HH:MI:SS')
This turned out to be the best advice. You get a gold star.

I had to jump through some logical hoops to maintain flexibility but I got it working nicely. Between this and the suggestion to index, I have now brought these page loads down to between "immediate" and "reasonably fast". I'm satisfied finally.

Thanks, guys, for all the help. I'll be back after I'm up to 5 million rows. :D

Post Reply