Need help optimizing database
Need help optimizing database
As part of my firewall ban project I have a table for information scraped from the firewall log. It has 2 columns: timestamp and IP. Each of those columns purposefully contain duplicate information. For example, an IP can attempt to connect multiple times and therefore be duplicated the same number of times. The firewall log timestamp only resolves to seconds (not milliseconds) so its possible to have duplicate entries for that as well.
As I approach a million rows, its starting to get very slow. I cannot think of a way (shocking, I know) to optimize the table because I really need these duplicate data for counting purposes. Ideally, I'd like to be able to split them somehow to make a unique column in order to speed up queries.
Any ideas?
As I approach a million rows, its starting to get very slow. I cannot think of a way (shocking, I know) to optimize the table because I really need these duplicate data for counting purposes. Ideally, I'd like to be able to split them somehow to make a unique column in order to speed up queries.
Any ideas?
Re: Need help optimizing database
unique entries and a third column for count
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation
https://www.hmailserver.com/documentation
Re: Need help optimizing database
I'm trying to visualize this.
Table 1:
id AUTOINCREMENT PRIMARY UNIQUE
ipaddress UNIQUE
ipcount INT
last-timestamp NOT NULL
Trigger = insert into table 2 id, timestamp
Table 2:
id (matches table 1 id)(not unique or primary - duplicates allowed)
timestamp (not unique or primary - duplicates allowed)
Don't I still end up in the same place with a million rows without a unique column in table 2? Seems like what I have is slightly more efficient than adding more columns and tables that effectively do the same thing. I know I'm missing something here. I just can't wrap my head around the missing link.
Re: Need help optimizing database
All columns you use in query you should have index on.
If you use "select x where timestamp > 'y' order by id" timestamp and id should have index on them to speed up query.
If you use "select x where timestamp > 'y' order by id" timestamp and id should have index on them to speed up query.
HMS 5.6.8 B2534.28 on Windows Server 2019 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.
- jimimaseye
- Moderator
- Posts: 10060
- Joined: 2011-09-08 17:48
Re: Need help optimizing database
+1
Eg
CREATE INDEX tmp_hm_fwban_timestamp ON hm_fwban (timestamp);
[Entered by mobile. Excuse my spelling.]
5.7 on test.
SpamassassinForWindows 3.4.0 spamd service
AV: Clamwin + Clamd service + sanesecurity defs : https://www.hmailserver.com/forum/viewtopic.php?f=21&t=26829
SpamassassinForWindows 3.4.0 spamd service
AV: Clamwin + Clamd service + sanesecurity defs : https://www.hmailserver.com/forum/viewtopic.php?f=21&t=26829
Re: Need help optimizing database
First of all, use SQL Prepared Statements.
If you insert dynamic string variables in your SQL-Statements, you force the Database Query Planer to Re-Evaluate every new SQL-Query which needs to be executed
while Prepared Statements needs to be calculated and compiled only once.
You also should avoid Table scan (full search of a table)
instead of Index Scans which are in most cases faster by creating Indexes.Avoid Foreign Keys Relations.
Reduce your PHP Result sets by only fetching the Data you need, not long lists of Data.
If you insert dynamic string variables in your SQL-Statements, you force the Database Query Planer to Re-Evaluate every new SQL-Query which needs to be executed
while Prepared Statements needs to be calculated and compiled only once.
You also should avoid Table scan (full search of a table)
instead of Index Scans which are in most cases faster by creating Indexes.Avoid Foreign Keys Relations.
Reduce your PHP Result sets by only fetching the Data you need, not long lists of Data.
Re: Need help optimizing database
I get that, but that's exactly the heart of my problem. Its impossible for either timestamp or ipaddress to be unique. Both NECESSARILY must be able to contain duplicates in order to get an accurate count - for the reasons I laid out in the OP.
Here's what I did so far:
hm_fwban_blocks_ip
id AUTOINCREMENT PRIMARY UNIQUE
ipaddress UNIQUE
hits INT (ON DUPLICATE KEY UPDATE hits=(hits+1))
lasttimestamp NOT NULL
Trigger = insert into table 2 id, ipaddress, timestamp
hm_fwban_rh (this is the original one that is causing all the problems)
ipid (matches table 1 id)(not unique or primary - duplicates allowed)
ipaddress (not unique or primary - duplicates allowed)
timestamp (not unique or primary - duplicates allowed)
This setup allows for minimal information about the total number of dropped connections to come from hm_fwban_blocks_ip - which, with primary and unique columns, returns results very quickly. But I still have the original problem when I need to draw more detailed data from hm_fwban_rh.
What if I split hm_fwban_rh into 3 tables? One for ipaddress, one for timestamp, and one for ????unique identifier???? I can't figure out how to bind them together without ending up with the same all-duplicates table.
Re: Need help optimizing database
OK, I thought of a way to fudge milliseconds in timestamp column in order to make it UNIQUE: just add milliseconds to the insert statement. This works if the column is formatted as BIGINT.
1) read firewall log, get timestamp (2020-02-23 10:18:43)
2) add ".001" to the end (2020-02-23 10:18:43.001)
3) INSERT INTO test_table (timestamp) VALUES (UNIX_TIMESTAMP('2020-02-23 12:22:44.001') * 1000)
OK, all of that is no problem. Now, if there is a duplicate timestamp, I need to add 1. I can't find the syntax for this operation. Unlike ON DUPLICATE KEY UPDATE, I want to insert a new value: timestamp=(timestamp+1)
ON DUPLICATE KEY UPDATE only updates the record with the same key. I need to add a new record. Is this possible?
1) read firewall log, get timestamp (2020-02-23 10:18:43)
2) add ".001" to the end (2020-02-23 10:18:43.001)
3) INSERT INTO test_table (timestamp) VALUES (UNIX_TIMESTAMP('2020-02-23 12:22:44.001') * 1000)
OK, all of that is no problem. Now, if there is a duplicate timestamp, I need to add 1. I can't find the syntax for this operation. Unlike ON DUPLICATE KEY UPDATE, I want to insert a new value: timestamp=(timestamp+1)
ON DUPLICATE KEY UPDATE only updates the record with the same key. I need to add a new record. Is this possible?
Re: Need help optimizing database
I made the unique timestamp thing work using powershell instead of directly in the query. Running "foreach" update now. Will let you know if this speeds things up in a few hours when its done.
Re: Need help optimizing database
You don´t need to add unique index.
ALTER TABLE tbl_name ADD INDEX index_name (column_list) − This adds an ordinary index in which any value may appear more than once.
ALTER TABLE tbl_name ADD INDEX index_name (column_list) − This adds an ordinary index in which any value may appear more than once.
HMS 5.6.8 B2534.28 on Windows Server 2019 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.
- jimimaseye
- Moderator
- Posts: 10060
- Joined: 2011-09-08 17:48
Re: Need help optimizing database
(I suggested this a week and half ago. Not sure why he chose to not do).
[Entered by mobile. Excuse my spelling.]
5.7 on test.
SpamassassinForWindows 3.4.0 spamd service
AV: Clamwin + Clamd service + sanesecurity defs : https://www.hmailserver.com/forum/viewtopic.php?f=21&t=26829
SpamassassinForWindows 3.4.0 spamd service
AV: Clamwin + Clamd service + sanesecurity defs : https://www.hmailserver.com/forum/viewtopic.php?f=21&t=26829
Re: Need help optimizing database
I was not aware of that. I misunderstood and thought it must be unique. I'll give that a shot. Thanks.
Re: Need help optimizing database
OK I indexed both ipaddress and timestamp, here's a typical query:
And results: (1550 total, Query took 0.3457 seconds.) <-- from phpmyadmin
However, the reality is it takes over 20 seconds. I'm not sure what the disconnect is.
Buuuuttt... Searches on ipaddress are now lightning fast (as well as page loads). I'm really confused over the whole thing. Neither ipaddress nor timestamp are unique. There are a helluva lot more "non-duplicated" timestamps than there are ipaddresses.
Code: Select all
SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE timestamp LIKE '2020-02%' GROUP BY ipaddress
However, the reality is it takes over 20 seconds. I'm not sure what the disconnect is.
Buuuuttt... Searches on ipaddress are now lightning fast (as well as page loads). I'm really confused over the whole thing. Neither ipaddress nor timestamp are unique. There are a helluva lot more "non-duplicated" timestamps than there are ipaddresses.
Re: Need help optimizing database
For more speed you should avoid using %.palinka wrote: ↑2020-02-24 16:57OK I indexed both ipaddress and timestamp, here's a typical query:And results: (1550 total, Query took 0.3457 seconds.) <-- from phpmyadminCode: Select all
SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE timestamp LIKE '2020-02%' GROUP BY ipaddress
However, the reality is it takes over 20 seconds. I'm not sure what the disconnect is.
Buuuuttt... Searches on ipaddress are now lightning fast (as well as page loads). I'm really confused over the whole thing. Neither ipaddress nor timestamp are unique. There are a helluva lot more "non-duplicated" timestamps than there are ipaddresses.
Can't you using LEFT (string, length)
Code: Select all
SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE LEFT(timestamp,7) = '2020-02' GROUP BY ipaddress
HMS 5.6.8 B2534.28 on Windows Server 2019 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.
HMS 5.6.9 B2641.67 on Windows Server 2016 Core VM.
Re: Need help optimizing database
Not the way it currently is. Would need a page redesign, but yeah, I'm willing to do whatever it takes to speed things up.tunis wrote: ↑2020-02-24 18:21For more speed you should avoid using %.
Can't you using LEFT (string, length)This is untestedCode: Select all
SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE LEFT(timestamp,7) = '2020-02' GROUP BY ipaddress
Re: Need help optimizing database
I tried it and it doesn't help. At least, its not noticeable.tunis wrote: ↑2020-02-24 18:21For more speed you should avoid using %.
Can't you using LEFT (string, length)This is untestedCode: Select all
SELECT DISTINCT(ipaddress), COUNT(ipaddress) AS countip, timestamp FROM hm_fwban_rh WHERE LEFT(timestamp,7) = '2020-02' GROUP BY ipaddress
Also, I experimented a little. After the index suggestion, I put it on both ipaddress and timestamp and that made ip queries super fast but timestamp queries really slow - like 20 seconds. Then I dropped the index from timestamp and got the same results. Then I dropped the index on ipaddress and put it back on timestamp. That made the whole thing a little faster than before.
Fortunately, because I run a demo site with separate data (and unaltered columns), I was able to compare speed.
Old: no indexes
New: index on timestamp
Result: with index on timestamp, pages load in about 2 seconds. The demo loads the same pages in about 4 seconds. So that's definitely an improvement, but I'm looking to speed it up even faster, if possible.
Re: Need help optimizing database
What about indexing BOTH columns together
https://www.mysqltutorial.org/mysql-ind ... ite-index/
https://www.mysqltutorial.org/mysql-ind ... ite-index/
Just 'cause I link to a page and say little else doesn't mean I am not being nice.
https://www.hmailserver.com/documentation
https://www.hmailserver.com/documentation
Re: Need help optimizing database
I'll give that a careful read.mattg wrote: ↑2020-02-25 05:01What about indexing BOTH columns together
https://www.mysqltutorial.org/mysql-ind ... ite-index/
in the meantime, this query solves half my problem. it's like a magical speed incantation.
Code: Select all
SELECT
ipaddress,
timestamp
FROM (
SELECT *
FROM hm_fwban_rh
WHERE '2020-01-01 00:00:00' <= timestamp
) AS A
WHERE timestamp < '2020-02-01 00:00:00'
I'm going to research query structure for just plain counting, which is the bulk of the other most used queries.
Re: Need help optimizing database
cast your TIMESTAMP string to TIMESTAMP Datatype before you compare Date and Time like this:
< STR_TO_DATE('2012-12-23 18:00:00','YYYY-MM-DD HH:MI:SS')
Next, use ALLWAYS PREPARED STATEMENTS!
Avoid Dynamic SQL like WHERE = 'blah'
Dynamic SQL always is forcing to re-anaylse your whole SQL-Query by the Query Optimizer which comes with hugh costs.
Prepared Statements on the other side doesn't needs to be re-analyzed and optimized over and over again and is way faster.
//prepared
< STR_TO_DATE('2012-12-23 18:00:00','YYYY-MM-DD HH:MI:SS')
Next, use ALLWAYS PREPARED STATEMENTS!
Avoid Dynamic SQL like WHERE = 'blah'
Dynamic SQL always is forcing to re-anaylse your whole SQL-Query by the Query Optimizer which comes with hugh costs.
Prepared Statements on the other side doesn't needs to be re-analyzed and optimized over and over again and is way faster.
//prepared
Code: Select all
$stmt = $pdo->prepare("SELECT * FROM users WHERE id=?");
Re: Need help optimizing database
Awesome. Thank you.
Many of my queries use DATE(timestamp). Is there any more efficient way of doing this?
Already do.Next, use ALLWAYS PREPARED STATEMENTS!Code: Select all
$stmt = $pdo->prepare("SELECT * FROM users WHERE id=?");
Another issue I have that slows things down (I think) is DISTINCT for searching ipadress results. Is that the fastest method of finding distinct entries? Mainly its used for COUNT(DISTINCT(ipaddress)).
Re: Need help optimizing database
Check your Slow Query log and check which takes longer than 500 ms.
Any under 500 ms is good, any above must be optimized.
https://serverpilot.io/docs/how-to-read ... query-log/
you can also use EXPLAIN command before your SQL-Query to get to know which part of your SQL-Statement is running a long time.
Any under 500 ms is good, any above must be optimized.
https://serverpilot.io/docs/how-to-read ... query-log/
you can also use EXPLAIN command before your SQL-Query to get to know which part of your SQL-Statement is running a long time.
Re: Need help optimizing database
This turned out to be the best advice. You get a gold star.
I had to jump through some logical hoops to maintain flexibility but I got it working nicely. Between this and the suggestion to index, I have now brought these page loads down to between "immediate" and "reasonably fast". I'm satisfied finally.
Thanks, guys, for all the help. I'll be back after I'm up to 5 million rows.